Time Sensitive Computation of Aggregate Functions over Distributed Imprecise Data Qi Han, Matthew Ba Nguyen Sandy Irani, Nalini Venkatasubramanian Distributed Systems Middleware Group http://www.ics.uci.edu/~dsm School of Information and Computer Science University of California-Irvine Motivation Real time applications often make decisions based on timely results aggregated over distributed data Example: real-time fault localization and problem diagnosis How many nodes are overloaded? (count) What is the total latency along the path N1N2 Nk? (sum) What is the bottleneck (minimum link bandwidth) along a path N1N2 Nk? (min) Challenges Continuous stream of fast changing source data Diverse user requirements in terms of data

accuracy and service timeliness Effective utilization of underlying computation, communication and storage resources competing goals of timeliness, accuracy and cost-effectiveness This paper Previous research Tradeoff between accuracy and cost storing data in ranges Tradeoffs between accuracy, cost and timeliness (single data item) RTSS2003 This paper addresses Tradeoffs between accuracy, cost and timeliness for aggregate functions (count, sum, min) Probing part of the sources might be sufficient How the server selects an appropriate subset of sources to probe so that the overall probing cost is minimized without violating accuracy

and timeliness requirements of aggregate queries Problem Characterization Queries: f(s1,s2,,si,sn), A, D Objective : minimize ci A: accuracy constraint D: time constraint Example: min(s1,s2,sn), 3, 1 minute iP (1) f (r1P , , rnP ) A subject to ( 2)T f D server c 2 /d 2 1 /d 1 /d cn di c i/

c Database [L1,U1],[L2,U2] [Ln,Un] n s1: E1 [L1,U1] s2: E2 [L2,U2] source 1 source 2 ... si: Ei [Li,Ui] source i ... Sn: En [Ln,Un] source n Time-sensitive Computation of Aggregate Functions Compute the function based on stored values in

the database If the answer meets accuracy constraint: done Otherwise: select probing set Only consider SD: the subset of sources whose diD Batch selection The entire set of sources to probe is selected before the probes actually occur Iterative selection The source is probed one at a time Batch Selection of Source Probing Set For Count (Batch_COUNT) Problem: calculate the number of source values that fall inside the range r=[l,u]: fcount=|{si|si[l,u]}| l u

r s1 s2 s3 s4 s5 s6 Inside | I | f count | I | | U | Outside Uncertain e.g. f count [1,4] Algorithm: if |U|>A: we must probe |U|-A sources to determine the function within the desired accuracy If |USD||U|-A: order the sources in |USD| according to increasing cost and probe the first |U|-A sources in this ordering o/w: cannot determine the the function to within the desired accuracy Batch Selection of Source

Probing Set for Sum (BATCH_SUM) n Problem : f sum si i 1 e.g. s1 [1,5] before probing : f sum li , ui f sum (ui li ) si s D si sD si s D s2 [2,4] f sum [3,9], f sum 6 probing si P ui li 0 P : probing set (u i li ) A, where P S P si P can be reduced to 0 / 1 knapsack problem The objective is to maximize ci si P Batch Selection of Source

Probing Set for Min (BATCH_MIN) n Problem : f min min si i 1 Algorithm : 1. identify upper and lower bound of the min function before probing : u min ui , l min li ; iS 2.if u-l A, decide subset L : L i | li u A ; 3. probe each source in L S D A Example: s1 s2 s3 u l s4 L s1 , s2 , s3 iS Issues in Computing min In the case of computing count or sum Know in advance exactly the benefit of probing any particular source, thus can decide in advance which sources to probe

count: probing any source decreases the uncertainty by 1 sum: probing source si decreases the uncertainty by ui-li In the case of computing min The number of probes required may vary depending on the values of the sources Example: s1=[0,5], s2=[1,6], A=1 If probe s1 and e1=2, then min=[1,2], done If probe s1 and e1=5, then min=[1,5], so s2 must be probed Iterative Selection of Source Probing Set for Min define current upper bound : u min min ui iS after probing si , the new upper bound will be min ei , u min Assuming ei is uniformly distributed over the range li , ui , expected amount by which the range for the minimum decreases when si is probed : u min ben( si ) li (li umin ) 2 umin x dx ui li

2(ui li ) probe sources that maximizes BCR ben( si ) ci d i Performance Evaluation Baseline policies compared with GREEDY: probe all LAZY: probe none RANDOM Performance metrics Cost ci iP Accuracy ratio (a/A): measures how close the answer interval a matches the accuracy constraint A Latency ratio (d/D): measure how close the time d spent answering the query matches the time constraint D

Accuracy satisfaction ratio: the percentage of queries with their accuracy constraints met Deadline satisfaction ratio: the percentage of queries with their time constraints met Basic Performance Results for Computing Count count 450 latency ratio 400 350 cost 300 250 200 150 100 50 0 RANDOM BATCH LAZY 4 10 9 8 7 6 5

4 3 2 1 0 3.5 accuracy ratio 500 GREEDY count count 3 2.5 2 1.5 1 0.5 0 GREEDY RANDOM BATCH LAZY GREEDY RANDOM BATCH

LAZY Basic Performance Results for Computing Min min min 1.2 1 accuracy satisfaction ratio deadline satisfaction ratio 1.2 0.8 0.6 0.4 0.2 0 1 0.8 0.6 0.4 0.2 0 GREEDY RANDOM BATCH ITERATIVE

LAZY GREEDY RANDOM BATCH ITERATIVE LAZY Performance of Computing Count under varying accuracy constraints count 400 350 4 3.5 300 250 200 150 3 2.5 accuracy ratio cost count

100 50 0 2 1.5 1 0.5 0 5 10 15 20 accuracy constraints 25 30 5 10 15 20 accuracy constraints 25 30

Performance of Computing Count under varying time constraints count count 3.5 250 3 accuracy ratio 300 cost 200 150 100 2.5 2 1.5 1 50 0.5 0 0 1 3 5

7 deadline 9 11 1 3 5 7 deadline 9 11 Conclusions The worst case analysis (batch selection algorithms) provides a bound on the cost to satisfy queries regardless of the exact values of sources More sophisticated models such as Gaussian distribution can be used to capture the change of source values Also interesting to conduct competitive analysis of algorithms for answering aggregate queries