Measure what is measurable and make measurable what is not so. – Galileo
These days many teams monitor their systems with Graphite or something alike. Such tools gather time-series data, store it, aggregate it and expose computed values via API or a dashboard.
One particularly interesting piece of information is the response time, often used as the main performance indicator in web applications or a metric against which SLA is verified. We look at this dashboard, see that the average response time is below a certain threshold and sleep well. Or – do we?
Average is tempting – after all even if we measure various response time percentiles we end up with a bunch of numbers – 99th percentile of response time for each application instance for each server in every data center, every 5 seconds. Wouldn’t it be nice to have just one? One number to rule them all…. But there is an inherent problem to averaging response times – it lies. It is completely incorrect. And yet it is used, over and over again.
The average of 1000 99%’lie measurement has as much to do with the 99%’lie behavior of a system as the current temperature on the surface of Mars does. Maybe less. – Gil Tene
“If you had the Maximum value recorded for each 5 second interval over a 5000 second run, and want to deduce the Maximum value for the whole run, would an average of the 1000 max values produce anything useful? No”. See this great post by Gil Tene if you don’t believe me yet.
We can’t use the average, but the need for a single number is real. When looking into metrics one wants to see the big picture, without having to focus on each server or each 5 second time period – we would like to see a single value for the past 10 minutes, 30 minutes, or a day. When speaking about response time the ideal situation would be if we could calculate a given percentile for a (distributed) component covering certain time period – an aggregated value. It would be even better if we could calculate all percentiles – effectively obtaining a histogram.
It’s worth mentioning that distribution of a performance metric has some specific needs, which might not appear when analysing distribution of other variables. In particular the higher the percentile, the higher accuracy is required. When looking at the 50%’ile then 1%’ile error might be acceptable, but the same error would make the 99%’ile useless. Some algorithms nicely reflect this need and are especially valuable for performance related metrics (t-digest, hdrhistogram). The problem of tracking quantiles is well known and is listed as #2 on the List of Open Problems in Sublinear Algorithms.
The simplest aggregate which works is maximum. Maximum of all 99th percentiles is accurate and has a reasonable meaning – the real 99th percentile is not worse than the calculated value. But is it important to remember that value aggregated this way is not a 99th percentile of the whole data set, it is the upper bound of 99%iles.
Despite that there are some nice advantages of this approach – it’s very simple and has O(1) space complexity.
On the other side of the spectrum we can find selection algorithms. These algorithms operate on the whole dataset and provide the exact value. An example of this approach is sorting n values and selecting k-th number where k/n = requested percentile.
Space complexity for this class of algorithms is O(n) which often makes them unfeasible in practice – when large number of values is involved.
Another approach is to divide the domain into multiple buckets and for each value/occurrence within the bucket boundaries increment this bucket’s count by one. This approach leads to a very efficient implementation and precise results for small domains as not many buckets are required, but for larger domains (e.g. values between 1 ns and 1 day) there is a need for some kind of varying bucket size. In this approach space complexity depends on the bucket size (accuracy) and the domain size, but not on the size of the data set.
Reservoir sampling is a family of algorithms for choosing a small sample of items from a list containing very large or unknown number of items . Good example for this type is Vitter’s R algorithm which is described further in more details. Required space in this case is proportional to the size of the reservoir.
Use of buckets or reservoir sampling leads to approximate results, but there are more advanced approximate algorithms specifically designed to be fast and require limited space. These algorithms employ sophisticated data structures and are an object of ongoing studies. The most interesting for our purpose of aggregating response time are q-digest and t-digest. Both of them require logarithmic space (in terms of the domain size). On the other hand Greenwald-Khanna algorithm provides bounds in terms of size of the data set. See Quantiles over Data Streams: An Experimental Study for a great overview.
A separate category worth mentioning are “frugal streaming algorithms” described in the paper Frugal Streaming for Estimating Quantiles. What is especially interesting about them is that they use only one unit of memory per group to compute a quantile for each group. The version computing median is sufficiently short to be included here fully:
Frugal-1U-Median Input: Data stream S, 1 unit of memory m Output: m Initialization m = 0 for each s in S do if s > m then m = m + 1; else if s < m then m = m − 1; end if end for
Note that the result is an estimation and not an exact value (the accuracy guarantees for frugal algorithms are rather weak).
Please do not treat the above list of categories as an attempt to a formal classification. This has been covered perfectly in Quantiles over Data Streams: An Experimental Study.
Below some of the most popular performance-related tools are listed along with short descriptions showing how they implement quantile aggregation (or “if”). For the more details about these projects please refer to their documentation.
The scenario which is considered here is that one wants to capture performance metrics (execution/response time percentiles) related to some application. The application is distributed and it is running on multiple servers (there are multiple instances of the application).
A very popular solution to metric aggregation is statsd – “a network daemon that runs on the Node.js platform and listens for statistics, like counters and timers, sent over UDP or TCP and sends aggregates to one or more pluggable backend services (e.g., Graphite)”. Metrics sent to statsd are stored inside buckets, periodically aggregated and sent to the backend service (flushed).
Whenever a new metric value is received it is assigned to a specific bucket and (in our case of measuring percentiles) stored there. Once flush interval passes aggregation happens. Percentile calculation is done using a naive (but precise) selection algorithm:
- at the beginning the array containing all the values is sorted
- during iteration over the array percentiles are recorded
Statsd nicely solves the problem of aggregation across application instances – every occurrence of an event is sent to the daemon where a single aggregation step computes the correct numbers. Of course this works as long as all occurrences can be handled by a single daemon instance – the space complexity is O(n), time complexity is the same as sort in Node.js – O(n log n), where n is the number of metric values.
Unfortunately aggregation over time is not covered by this solution – percentile values are sent to a backend system for later use and their form prevents any further aggregation. If the flush interval is 10 seconds then we have to look at the data with 10 second precision whenever we are analysing it. Attempts “rescale” to lower precision (wider intervals) of 1 hour or 1 day introduce errors and the data loses its meaning. Which is likely to happen as the downstream system is often…
“Graphite is a highly scalable real-time graphing system. As a user, you write an application that collects numeric time-series data that you are interested in graphing, and send it to Graphite’s processing backend, carbon, which stores the data in Graphite’s specialized database. The data can then be visualized through graphite’s web interfaces”.
Graphite backend has three main built-in mechanism for aggregation:
- aggregation to lower-precision retentions (downsampling) – this mechanism allows to define a list of retention periods and related frequencies along with an algorithm used for aggregation. E.g. it’s possible to state that for 4 days the metric should be stored in ‘per 15 second’ units, for the next 14 days in ‘per minute’ and after than the unit should be 1 hour. Legal aggregation methods are: average, sum, min, max, and last. The default is average.
- carbon-aggregator – “Defines a time interval (in seconds) and aggregation function (sum or average) for incoming metrics matching a certain pattern. At the end of each interval, the values received are aggregated and published to carbon-cache.py as a single metric”. The available aggregation methods are sum and average.
- aggregation during query – graphite-web (and other interfaces, e.g. Grafana) allow to use a rich set of transformations to alter or combine metrics. Besides regular average, sum, min or max also more advanced aggregating functions are provided, e.g. percentile of series which basically computes percentile of percentiles. Unfortunately like regular average it does not guarantee any accuracy, although it might seem to be more reasonable.
As described earlier averaging percentiles is a really bad approach, so none of the above methods is able to handle percentile values correctly. In addition the default downsampling mechanism is the average which might result in very misleading reports without user’s knowledge.
Graphite, unlike statsd, is not able to calculate percentile values by itself and the proper way to work with metrics representing percentiles is to avoid downsampling or use of carbon-aggregator.
Previously known under the name ‘Yammer metrics’ or ‘Coda Hale metrics’. A great Java library for adding monitoring to applications and microservices in particular. It’s available by default in Dropwizard but there are multiple integrations with many existing tools and frameworks. It nicely abstracts away API for recording metrics from the actual implementation and the implementations are switchable.
- The default implementation for histograms (and timers in particular) is an exponentially decaying reservoir. It would be great if… it worked. There are many sources describing issues in this implementation. This quote nicely underlines them:
Could you please explain what important information Metrics can miss?
Pretty much all of it. The Max, Min, the average, the 90%’lie, the 99%’ile. They are all wrong unless latency behavior happens to meet your wish that it look like a normal distribution. I don’t know of any information in there that is right… Since it’s a lossy sampling filter, the more “rare” the metric is (i.e. max and 99.9% as opposed to median), the higher the likelihood is that it is off by an order of magnitude or more.
- Another available implementation is UniformReservoir which uses Vitter’s Algorithm R. Note – it also has the problems of exponentially decaying reservoir mentioned above.
Algorithm R works as follows:
take first V1…Vn values and put them into an array A
for each new value Vn+i pick a random number j between 1 and n+i
if j < n then replace random element of A with V
This simple algorithm replaces elements in the array with gradually decreasing probability leading to a reservoir (the array) containing a random sample of the entire data set.
As above algorithms are rather poor for percentile calculation the suggested solution is to use a Metrics Reservoir implementation backed by HdrHistogram
High Dynamic Range Histogram – A great library by Gil Tene. Currently implementations exist for Java, C, C#, Go and Erlang. It is battle tested and provides certain desirable qualities.
Essentially hdrhistogram divides the entire domain (predefined range of values) into multiple buckets of varying size. The bucket size depends on the order of values stored and ensures proper precision (e.g. of 2 significant digits).
The best overview of this library is a quote from the documentation:
[The main feature is the fact that] precision is expressed as the number of significant digits in the value recording, and provides control over value quantization behavior across the value range and the subsequent value resolution at any given level.
For example, a Histogram could be configured to track the counts of observed integer values between 0 and 3,600,000,000 while maintaining a value precision of 3 significant digits across that range. (…) This example Histogram could be used to track and analyze the counts of observed response times ranging between 1 microsecond and 1 hour in magnitude, while maintaining a value resolution of 1 microsecond up to 1 millisecond, a resolution of 1 millisecond (or better) up to one second, and a resolution of 1 second (or better) up to 1,000 seconds. At its maximum tracked value (1 hour), it would still maintain a resolution of 3.6 seconds (or better).
This post contains an interesting overview of hdrhistogram.
Turbine is a tool which provides low latency event stream aggregation. It plays well and is best used together with Hystrix Dashboard. Both tools are really great, but the dashboard shows aggregated percentile values and this triggers some suspicions.
So how exactly does it work?
- Hystrix uses a circular buffer of K buckets which implement a bucketed sliding time window. Each bucket stores N latest measurements. Calculating percentiles is done on a sample of N*K measurements.
(The red flag here is that all measurements except N are lost in a coordinated way, but let’s focus on aggregation)
- Application instance exposes a Server-Sent Events stream emitting “snapshots” of the metrics every second. Single event contains already calculated percentile values:
That’s another red flag – the percentiles are calculated for this specific application instance and as described at the beginning of this post it is not possible to aggregate them correctly across multiple instances.
- Turbine consumes event streams from all application instances and sums them.
- Hystrix dashboard consumes the stream exposed by Turbine and divides the value by the number of application instances
In the end we are getting an (incorrect) averaging of percentile values which is possibly very misleading.
AppDynamics is one of two leading APM (Application Performance Management) tools – the other one is NewRelic. Using provided agent libraries it is possible to instrument applications to publish captured performance statistics to the central repository.
One of the metrics which can be captured are response time percentiles.
Appdynamics states in their documentation that percentile calculation / aggregation algorithm can be chosen, two are available:
- P2 (or P-square)
P-square is a simple approximate algorithm proposed in 1985 by Raj Jain and Imrich Chlamtac. The idea behind it is to divide all values into 4 buckets with boundaries being the current estimates of: min, 1st quartile, median, 3rd quartile and max. Those 5 markers are updated appropriately after every new value is processed.
While P-square is fast it is much less accurate then the second algorithm – q-digest.
Q-digest was introduced in 2004 and until recently was the state-of-the-art for small domains (which is usually the case when measuring response time). In essence q-digest is a binary tree over whole data set, but additionally values with low frequencies can propagate to parent nodes. This post is a very good description of the algorithm.
Elasticsearch (a highly scalable open-source full-text search and analytics engine) provides quite sophisticated set of aggregations. Among them two which are of interest to us – percentiles aggregation and percentile ranks aggregation.
Percentiles aggregation uses t-digest algorithm introduced in 2013 by Ted Dunning and Otmar Ertl. T-digest extends Q-digest to values in ℝ (set of real numbers). The authors claim that: “The t-digest provides accurate on-line estimates of a variety of of rank-based statistics including quantiles and trimmed mean. The algorithm is simple and empirically demonstrated to exhibit accuracy as predicted on theoretical grounds. It is also suitable for parallel applications. Moreover, the t-digest algorithm is inherently on-line while the Q-digest is an on-line adaptation of an off-line algorithm”. Another well-known project using t-digest is Apache Mahout.
The second mentioned feature (percentile ranks aggregation) is the inverse function of percentiles aggregation – given a values it finds which percentile is closest to it. E.g. if 95th percentile of given distribution is 30, then percentile rank of 30 is 0.95.
Newrelic is the second leading APM tool. It works in a fashion similar to AppDynamics.
The algorithm used for percentile calculation and aggregation is called “Random” by the authors. One if it’s main qualities is the simplicity:
- the data set is divided into fixed-size buckets
- whenever two buckets exist on the same level Ln they are merged into bucket on the level Ln+1. Merge operation consists of following steps:
gather values from both buckets and sort them
randomly choose even or odd
put all even or all odd items (depending on the result of previous step) into new bucket
As a result bucket on the level Ln is created by merging 2^(Ln)*bucket_size values and represents them.
Querying of such structure is done indirectly. First we need a way to calculate rank of an element (a percentile corresponding to the given value). The rank(x) is equal to sum of the rank(x) within each existing bucket multiplied but 2level of the bucket:
The exact percentile can be found via binary search.
You’ve reached the end of this brief overview of quantile aggregation algorithms and their implementations. I hope that some of these details will help you better understand data shown on various performance dashboards and look more critically on those presenting aggregated values for response time.