Latency

I’m going to tell you a sad story in 4 sentences.

This story is as old as a startup world.

  1. You write a little app with one redis as kv store, and everything works fine.
  2. You blink
  3. … and you now have 10.
  4. …. And you access them all sequentially, one-by-one on every request.

So let’s talk about how do we measure latency, why sequential calls are not great, and what we can do about it.

Subsections of Latency

How do we measure latency

When we try to measure a latency to backend, including Redis, we would see a chart similar to this one: a828c859fafd4552be5330a509673c81.png

We often simplify things, saying that it’s a “lognormal” distribution, but if you try to fit lognoram distribution, you will find that real-world distribution has a very unplesant tail that doesn’t really want to be fitted.

However, lognormal distribution is sure much prettier to look at: 7e0703669ce812f3f7b8acab6b2c3fee.png

Since we started speaking of tails, how do we measure one?

Easy! We can sort all our measurements, count 50% lowest latencies, and ask what is the highest latency for this group? We would also want this number of 50, 95, and 99 percentile, and sometimes 99.9. For shortness, we call it p50, p95, p99.

e9819ceb2c8349143a72ce8eed1feeae.png

Poorly fitted, but much prettier chart for lognormal distribution looks like this: 78efccb0cb5e031e3e08cfd673420a3f.png

Overall p99 latency

When you show p99 over time for different backends it would most likely look something like this, if the servers have different loads:

2664c3b4e73d537fa3fa82bf3408cc24.png

When there are a lot of lines in there, fight the urge to average over p99. It sure looks prettie: 1982ba8fb66cf5d28708cd8ae159d683.png

Howeer, it destroys information about load balancing among your servers.

One of the “right ways” to calculate system’s p99 latency is by using a sample of latencies across the system. But this can be burdensome.

May be the way is not to calculate it at all.

A good reason not to use average is to think about a couple of situations where averaging will give a very wrong results:

High p99 in part of the system

Node Latency RPS
Node 1 100 ms 1000
Node 2 200 ms 1000

If you sort their latencies, most likely, second node’s latencies would be so far to right compared to the first one, and the overall latency would be dominated by the second node. So the “true” p99 would be 200ms, but the average is just 150ms. Not great.

Unbalanced load

A second case, if you have unbalanced load profile, where one (some) of your nodes handle a higher load. That easily can happen in heterogeneous system with baremetal servers.

Consider two nodes from our distribution

Node Latency RPS
Node 1 151 ms 1000
Node 2 78 ms 10000

The average of that is (151+78)/2 = 114.5. But in reality, that can be just 80 ms, because Node 1 with high RPS would “dominate” the distriubiton.

Latency: from linear growth to exponential decay with number of backends

We’ve already discussed how do we measure latency and that we use p95, p99, and sometimes p999.

Let’s look at what happens when we have multiple backends. For simplicity, I’m going to use the same latency profile for all backends.

And you can think of these backends as redises.

Sequential access is not great.

Imagine that when request comes we handle it like this:

  1. Some logic
  2. Access microservice
  3. Some extra logic … N. response

eca140fdef760cf451bbd3312db61426.png

In this case, tail latencies will stack up in an expected linear manner.

dea0d741d7b4735c9ea55f70b190870e.png

Can we do anything about it?

Yes, we can! If we can refactor our app to make parallel calls to all downstream microservices in parallel, our IO latency would be capped by the slowest response of a single service!

20cb95c377d371c3dcb250515bbcd2a4.png

In that case, tail latency would grow roughly logarithmically with the number of backends:

a4fcf903c449269d5ffbbc0e7e0d9f92.png

If we look at what sequential IO gives us, the picture becomes clearer: 5e942b66546fadbc6dc1d23b2d360e29.png

It still slower than having just 1 microservice to access, but it’s a very significant positive change.

Can we do something more about it?

Parallel microservices access

Yes, we can! If we can use multiple replicas, we can do parallel calls to all of them, and when we recieve the fastest answer, cancel all other in-flight requests, we would have a vey nice p99

b4e03fcf8c2e2639915c8e6437c8afcc.png

Please also note, that we get the most results by having just a few (3) read replicas.

Improving sequential microservices access

So for cases when we can’t change the sequetial matter of our code, we can add read replicas, and expect some imporovement, if we send requests to all read replicas in parallel, and as soon as we get first one (i.e. the fastest one), we go on.

9fd2e11c22ba295137ea154a1c7cebb4.png

In general, looks like for practical purposes and a single backend, it does quite well at 2 read replicas.

But it’s not great.

70c9b10ead33b3d4afa03b1763b1a54f.png

Improving parallel microservices access

If you call multiple read replicas in parallel, you would get a remarkable improvement in terms of latency.

6f6a2905100b60581705119b2ab17595.png

There are a few points to consider:

  1. We are already fast with parallel calls. Do we need to be faster? Is it worth it?
  2. We likely have replicas for fault tolerance. Can we use them as read-only replicas?
  3. Network overhead. If you are close to network stack saturation on your nodes, does it make sense to double or triple the load?

020bbcd177cf2405973b25bdfcabb690.png