Finding tail latency bottlenecks

Finding bottlenecks for tail requests latency (or percentile 99 P99) of a system is difficult.

One of the reason is that we need to find out what happened during the time frame (often milliseconds) of each of those 1% slow requests. If you only look at profiling tools showing CPU/IO, they will tell you the average case or overall macro picture of the system. However, to find the root cause, we need to really focus on those milliseconds when the system showed high latency.

Remember that if high latency happens randomly for 100 msec for every 10 seconds during 1 hour test run, it is impacting your P99 or even P95. Here I would like to share my experience debugging the same.

Let's take an example of read latency of a database. I had written a read performance test where I issued many read requests to the database and computed the latency. I scheduled multiple test runs. Most of the test runs had P99 under 1 msec, while in a few runs, it was as high as 7 msec.

For some background, let's talk about the database (db) itself. It has a cache of key value pairs. If there is a cache miss, we read from the db file. Database is written in C# language and this cache is a C# ConcurrentDictionary. That's it. :)

At this point, I profiled app using Perfview (my favorite performance analysis tool on windows). I looked at overall CPU profile. It gave me how much time different functions took. CPU was nicely distributed among different functions. So, i didn't spent time in optimizing functions (remember profilers show overall cpu work).

I knew file io could be a bottleneck. So, I looked at "File I/O stacks" view and looking at high "Inc %" of time spent in file system, I convinced myself that file io is the bottleneck.
So, this is the best that we could do.

file io suspected as causing high latency

I talked about this with my team members. We found that my findings were not convincing. We next wanted to look into what happened during 10 msecs of slowest requests.

Perfview has a Start/Stop Activity feature where we can track progress of a request over async Tasks (more on it in future blog). It looked useful to what I was trying to do.

So, my plan was:

Use start/stop activity feature to track down the slow requests.
Find out why they are slow from the profile by looking at the CPU.. (fingers crossed)

To find slow requests, I looked at “Events” view in Perfview. Then selected my custom request start/stop ETW events, split text by putting "DURATION_MSEC, ActivityID" in "columns to display", then sorting by DURATION_MSEC. "Time Msec" column gives end time for request during “Read/Stop” event. Start time is Time Msec - DURATION_MSEC. Please see below snapshot.

start stop activity feature in Perfview showing start time of a request, duration of the request and activity id

I looked what was running on the CPU during some random slow requests by restricting Perfview CPU sample view to start and end time of the request.

It came out to be Garbage Collection (GC). [ Overall CPU profile did not show GC functions in top 40 functions, which is the main summary of this blog ]

Example of one of a 9 msec slow read request shows GC was running on the CPU most of the time:

garbage collection running on CPU during 9 msec of slow request run

I tweaked around with different GC settings and we could reduce the GC for our test app from 7 msec P99 to less than1 msec.

App was using Workstation GC (notice WKS in above stack trace). We switched to Server GC and latency came down to 3 msec. We disabled GC completely during test run and latency reduced to below 1 msec.

This convinced us that we don’t have other bottlenecks in read path other than Garbage Collection. Maoni Stephens from .net gc team has great GC documentation on tuning app for GC impact.

Final numbers:

Server GC	GC running during test run	P99
Disabled	Yes	7 msec
Enabled	Yes	3 msec
Enabled	No	< 1 msec

Please share your thoughts and any tips on profiling for tail latencies.

In future blogs, we will look into Start/Stop Activity feature in Perfview and other percentile improvement methods.

what we learned today?

Search This Blog

Finding tail latency bottlenecks

Comments

Post a Comment