When talking about statistics and monitoring percentiles immediately come to mind. What are those and why should you care?

What is a percentile?

Maybe you have not heard about percentiles, buy you definitely know what median is.

Suppose, you have 8 siblings and each of your family members has the following heights:

Martin Jeremy Anna Mom Mark You Bob Dad Simon Jenny
1m 1.1m 1.5m 1.5m 1.6m 1.6m 1.7m 1.7m 1.8m 1.9m

You and your brother Mark have median height in your family. It means that you are separating the lower half of your family (like your mum) from the higher half (represented by your dad).

According to Wikipedia:

Median is the number separating the higher half of a data sample from the lower half

Actually, median is a special case of a percentile called 50th percentile. Lets again consult Wiki on that:

Percentile is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall

If we take Simon as an example, he is the 90th percentile of your family. Only 10% of your family is higher than him (actually it is only Jenny).

Percentiles in monitoring

Now that we have definition out of our way lets discuss how can we use percentiles in practice. Percentiles are exceptionally useful when talking about latency, however they are also a great tool for explaining decomposition of any other metric.

As an example lets take a look at the following chart:

Percentile chart

If somebody asked you to describe the latency of requests served by your system a chart like that proves to be invaluable. It is quite straightforward. You can explain that 95% of your traffic is served with at most 240ms. 95% of requests are handled even better: 60ms tops.

Percentiles misuse

Percentiles are very useful. No one will argue with that. Unfortunately, they can be also very misleading and are often (over/mis)used. And actually, same goes for all statistical measures.

Teenage pregnancy drops after age of 25

If you would like to learn more about percentiles I strongly suggest to watch the presentation How NOT to Measure Latency by Gil Tene. You will not regret it.

Max vs 99th percentile

Going back to our example it is most likely that our maximum values are much greater than any 99th’ish percentile:

Percentile chart

If you have not measured max values you could have been unintentionally lying about the latency the whole time. It feels bad but at the same time quite enlightening that you were misinformed (and probably also misinforming others) about the performance of your system.

TLDR;