Measure what is measurable and make measurable what is not so. – Galileo
These days many teams monitor their systems with Graphite or something alike. Such tools gather timeseries data, store it, aggregate it and expose computed values via API or a dashboard.
One particularly interesting piece of information is the response time, often used as the main performance indicator in web applications or a metric against which SLA is verified. We look at this dashboard, see that the average response time is below a certain threshold and sleep well. Or – do we?
Average is tempting – after all even if we measure various response time percentiles we end up with a bunch of numbers – 99th percentile of response time for each application instance for each server in every data center, every 5 seconds. Wouldn’t it be nice to have just one? One number to rule them all…. But there is an inherent problem to averaging response times – it lies. It is completely incorrect. And yet it is used, over and over again.
The average of 1000 99%’lie measurement has as much to do with the 99%’lie behavior of a system as the current temperature on the surface of Mars does. Maybe less. – Gil Tene
“If you had the Maximum value recorded for each 5 second interval over a 5000 second run, and want to deduce the Maximum value for the whole run, would an average of the 1000 max values produce anything useful? No”. See this great post by Gil Tene if you don’t believe me yet.
We can’t use the average, but the need for a single number is real. When looking into metrics one wants to see the big picture, without having to focus on each server or each 5 second time period – we would like to see a single value for the past 10 minutes, 30 minutes, or a day. When speaking about response time the ideal situation would be if we could calculate a given percentile for a (distributed) component covering certain time period – an aggregated value. It would be even better if we could calculate all percentiles – effectively obtaining a histogram.
How to aggregate properly?
It’s worth mentioning that distribution of a performance metric has some specific needs, which might not appear when analysing distribution of other variables. In particular the higher the percentile, the higher accuracy is required. When looking at the 50%’ile then 1%’ile error might be acceptable, but the same error would make the 99%’ile useless. Some algorithms nicely reflect this need and are especially valuable for performance related metrics (tdigest, hdrhistogram). The problem of tracking quantiles is well known and is listed as #2 on the List of Open Problems in Sublinear Algorithms.
Upper bound
The simplest aggregate which works is maximum. Maximum of all 99th percentiles is accurate and has a reasonable meaning – the real 99th percentile is not worse than the calculated value. But is it important to remember that value aggregated this way is not a 99th percentile of the whole data set, it is the upper bound of 99%iles.
Despite that there are some nice advantages of this approach – it’s very simple and has O(1) space complexity.
Know it all
On the other side of the spectrum we can find selection algorithms. These algorithms operate on the whole dataset and provide the exact value. An example of this approach is sorting n values and selecting kth number where k/n = requested percentile.
Space complexity for this class of algorithms is O(n) which often makes them unfeasible in practice – when large number of values is involved.
Buckets
Another approach is to divide the domain into multiple buckets and for each value/occurrence within the bucket boundaries increment this bucket’s count by one. This approach leads to a very efficient implementation and precise results for small domains as not many buckets are required, but for larger domains (e.g. values between 1 ns and 1 day) there is a need for some kind of varying bucket size. In this approach space complexity depends on the bucket size (accuracy) and the domain size, but not on the size of the data set.
Reservoir sampling
Reservoir sampling is a family of algorithms for choosing a small sample of items from a list containing very large or unknown number of items . Good example for this type is Vitter’s R algorithm which is described further in more details. Required space in this case is proportional to the size of the reservoir.
Other approximate
Use of buckets or reservoir sampling leads to approximate results, but there are more advanced approximate algorithms specifically designed to be fast and require limited space. These algorithms employ sophisticated data structures and are an object of ongoing studies. The most interesting for our purpose of aggregating response time are qdigest and tdigest. Both of them require logarithmic space (in terms of the domain size). On the other hand GreenwaldKhanna algorithm provides bounds in terms of size of the data set. See Quantiles over Data Streams: An Experimental Study for a great overview.
Frugal streaming
A separate category worth mentioning are “frugal streaming algorithms” described in the paper Frugal Streaming for Estimating Quantiles. What is especially interesting about them is that they use only one unit of memory per group to compute a quantile for each group. The version computing median is sufficiently short to be included here fully:
Frugal1UMedian
Input: Data stream S, 1 unit of memory m
Output: m
Initialization m = 0
for each s in S do
if s > m then
m = m + 1;
else if s < m then
m = m − 1;
end if
end for
Note that the result is an estimation and not an exact value (the accuracy guarantees for frugal algorithms are rather weak).
Please do not treat the above list of categories as an attempt to a formal classification. This has been covered perfectly in Quantiles over Data Streams: An Experimental Study.
Aggregation in practice
Below some of the most popular performancerelated tools are listed along with short descriptions showing how they implement quantile aggregation (or “if”). For the more details about these projects please refer to their documentation.
The scenario which is considered here is that one wants to capture performance metrics (execution/response time percentiles) related to some application. The application is distributed and it is running on multiple servers (there are multiple instances of the application).
Statsd
A very popular solution to metric aggregation is statsd – “a network daemon that runs on the Node.js platform and listens for statistics, like counters and timers, sent over UDP or TCP and sends aggregates to one or more pluggable backend services (e.g., Graphite)”. Metrics sent to statsd are stored inside buckets, periodically aggregated and sent to the backend service (flushed).
Whenever a new metric value is received it is assigned to a specific bucket and (in our case of measuring percentiles) stored there. Once flush interval passes aggregation happens. Percentile calculation is done using a naive (but precise) selection algorithm:
 at the beginning the array containing all the values is sorted
 during iteration over the array percentiles are recorded
Statsd nicely solves the problem of aggregation across application instances – every occurrence of an event is sent to the daemon where a single aggregation step computes the correct numbers. Of course this works as long as all occurrences can be handled by a single daemon instance – the space complexity is O(n), time complexity is the same as sort in Node.js – O(n log n), where n is the number of metric values.
Unfortunately aggregation over time is not covered by this solution – percentile values are sent to a backend system for later use and their form prevents any further aggregation. If the flush interval is 10 seconds then we have to look at the data with 10 second precision whenever we are analysing it. Attempts “rescale” to lower precision (wider intervals) of 1 hour or 1 day introduce errors and the data loses its meaning. Which is likely to happen as the downstream system is often…
Graphite
“Graphite is a highly scalable realtime graphing system. As a user, you write an application that collects numeric timeseries data that you are interested in graphing, and send it to Graphite’s processing backend, carbon, which stores the data in Graphite’s specialized database. The data can then be visualized through graphite’s web interfaces”.
Graphite backend has three main builtin mechanism for aggregation:
 aggregation to lowerprecision retentions (downsampling) – this mechanism allows to define a list of retention periods and related frequencies along with an algorithm used for aggregation. E.g. it’s possible to state that for 4 days the metric should be stored in ‘per 15 second’ units, for the next 14 days in ‘per minute’ and after than the unit should be 1 hour. Legal aggregation methods are: average, sum, min, max, and last. The default is average.
 carbonaggregator – “Defines a time interval (in seconds) and aggregation function (sum or average) for incoming metrics matching a certain pattern. At the end of each interval, the values received are aggregated and published to carboncache.py as a single metric”. The available aggregation methods are sum and average.
 aggregation during query – graphiteweb (and other interfaces, e.g. Grafana) allow to use a rich set of transformations to alter or combine metrics. Besides regular average, sum, min or max also more advanced aggregating functions are provided, e.g. percentile of series which basically computes percentile of percentiles. Unfortunately like regular average it does not guarantee any accuracy, although it might seem to be more reasonable.
As described earlier averaging percentiles is a really bad approach, so none of the above methods is able to handle percentile values correctly. In addition the default downsampling mechanism is the average which might result in very misleading reports without user’s knowledge.
Graphite, unlike statsd, is not able to calculate percentile values by itself and the proper way to work with metrics representing percentiles is to avoid downsampling or use of carbonaggregator.
Dropwizard Metrics
Previously known under the name ‘Yammer metrics’ or ‘Coda Hale metrics’. A great Java library for adding monitoring to applications and microservices in particular. It’s available by default in Dropwizard but there are multiple integrations with many existing tools and frameworks. It nicely abstracts away API for recording metrics from the actual implementation and the implementations are switchable.
 The default implementation for histograms (and timers in particular) is an exponentially decaying reservoir. It would be great if… it worked. There are many sources describing issues in this implementation. This quote nicely underlines them:
Could you please explain what important information Metrics can miss?
Pretty much all of it. The Max, Min, the average, the 90%’lie, the 99%’ile. They are all wrong unless latency behavior happens to meet your wish that it look like a normal distribution. I don’t know of any information in there that is right… Since it’s a lossy sampling filter, the more “rare” the metric is (i.e. max and 99.9% as opposed to median), the higher the likelihood is that it is off by an order of magnitude or more.
 Another available implementation is UniformReservoir which uses Vitter’s Algorithm R. Note – it also has the problems of exponentially decaying reservoir mentioned above.
Algorithm R works as follows:
take first V_{1}…V_{n} values and put them into an array A

for each new value V_{n+i} pick a random number j between 1 and n+i

if j < n then replace random element of A with V

repeat bc
This simple algorithm replaces elements in the array with gradually decreasing probability leading to a reservoir (the array) containing a random sample of the entire data set.

As above algorithms are rather poor for percentile calculation the suggested solution is to use a Metrics Reservoir implementation backed by HdrHistogram
HdrHistogram
High Dynamic Range Histogram – A great library by Gil Tene. Currently implementations exist for Java, C, C#, Go and Erlang. It is battle tested and provides certain desirable qualities.
Essentially hdrhistogram divides the entire domain (predefined range of values) into multiple buckets of varying size. The bucket size depends on the order of values stored and ensures proper precision (e.g. of 2 significant digits).
The best overview of this library is a quote from the documentation:
[The main feature is the fact that] precision is expressed as the number of significant digits in the value recording, and provides control over value quantization behavior across the value range and the subsequent value resolution at any given level.
For example, a Histogram could be configured to track the counts of observed integer values between 0 and 3,600,000,000 while maintaining a value precision of 3 significant digits across that range. (…) This example Histogram could be used to track and analyze the counts of observed response times ranging between 1 microsecond and 1 hour in magnitude, while maintaining a value resolution of 1 microsecond up to 1 millisecond, a resolution of 1 millisecond (or better) up to one second, and a resolution of 1 second (or better) up to 1,000 seconds. At its maximum tracked value (1 hour), it would still maintain a resolution of 3.6 seconds (or better).
This post contains an interesting overview of hdrhistogram.
Turbine + Hystrix Dashboard
Turbine is a tool which provides low latency event stream aggregation. It plays well and is best used together with Hystrix Dashboard. Both tools are really great, but the dashboard shows aggregated percentile values and this triggers some suspicions.
So how exactly does it work?
 Hystrix uses a circular buffer of K buckets which implement a bucketed sliding time window. Each bucket stores N latest measurements. Calculating percentiles is done on a sample of N*K measurements.
(The red flag here is that all measurements except N are lost in a coordinated way, but let’s focus on aggregation)  Application instance exposes a ServerSent Events stream emitting “snapshots” of the metrics every second. Single event contains already calculated percentile values:
"latencyExecute":{"0":0,"25":0,"50":3,"75":12,"90":17,"95":24,"99":48,"99.5":363,"100":390}
That’s another red flag – the percentiles are calculated for this specific application instance and as described at the beginning of this post it is not possible to aggregate them correctly across multiple instances.  Turbine consumes event streams from all application instances and sums them.
 Hystrix dashboard consumes the stream exposed by Turbine and divides the value by the number of application instances
In the end we are getting an (incorrect) averaging of percentile values which is possibly very misleading.
AppDynamics
AppDynamics is one of two leading APM (Application Performance Management) tools – the other one is NewRelic. Using provided agent libraries it is possible to instrument applications to publish captured performance statistics to the central repository.
One of the metrics which can be captured are response time percentiles.
Appdynamics states in their documentation that percentile calculation / aggregation algorithm can be chosen, two are available:
 P^{2} (or Psquare)
 qdigest
Psquare is a simple approximate algorithm proposed in 1985 by Raj Jain and Imrich Chlamtac. The idea behind it is to divide all values into 4 buckets with boundaries being the current estimates of: min, 1st quartile, median, 3rd quartile and max. Those 5 markers are updated appropriately after every new value is processed.
While Psquare is fast it is much less accurate then the second algorithm – qdigest.
Qdigest was introduced in 2004 and until recently was the stateoftheart for small domains (which is usually the case when measuring response time). In essence qdigest is a binary tree over whole data set, but additionally values with low frequencies can propagate to parent nodes. This post is a very good description of the algorithm.
Elasticsearch
Elasticsearch (a highly scalable opensource fulltext search and analytics engine) provides quite sophisticated set of aggregations. Among them two which are of interest to us – percentiles aggregation and percentile ranks aggregation.
Percentiles aggregation uses tdigest algorithm introduced in 2013 by Ted Dunning and Otmar Ertl. Tdigest extends Qdigest to values in ℝ (set of real numbers). The authors claim that: “The tdigest provides accurate online estimates of a variety of of rankbased statistics including quantiles and trimmed mean. The algorithm is simple and empirically demonstrated to exhibit accuracy as predicted on theoretical grounds. It is also suitable for parallel applications. Moreover, the tdigest algorithm is inherently online while the Qdigest is an online adaptation of an offline algorithm”. Another wellknown project using tdigest is Apache Mahout.
The second mentioned feature (percentile ranks aggregation) is the inverse function of percentiles aggregation – given a values it finds which percentile is closest to it. E.g. if 95th percentile of given distribution is 30, then percentile rank of 30 is 0.95.
NewRelic
Newrelic is the second leading APM tool. It works in a fashion similar to AppDynamics.
The algorithm used for percentile calculation and aggregation is called “Random” by the authors. One if it’s main qualities is the simplicity:
 the data set is divided into fixedsize buckets
 whenever two buckets exist on the same level L_{n} they are merged into bucket on the level L_{n+1}. Merge operation consists of following steps:

gather values from both buckets and sort them

randomly choose even or odd

put all even or all odd items (depending on the result of previous step) into new bucket

As a result bucket on the level L_{n} is created by merging 2^(L_{n})*bucket_size values and represents them.
Querying of such structure is done indirectly. First we need a way to calculate rank of an element (a percentile corresponding to the given value). The rank(x) is equal to sum of the rank(x) within each existing bucket multiplied but 2^{level} of the bucket:
The exact percentile can be found via binary search.
Summary
You’ve reached the end of this brief overview of quantile aggregation algorithms and their implementations. I hope that some of these details will help you better understand data shown on various performance dashboards and look more critically on those presenting aggregated values for response time.