14 October 2021

The Five Lies Analysis

The Five Whys analysis is a popular root cause investigation technique with a simple premise that is asking why five times can yield the answer, which is the root cause. While doing so could be a helpful exercise, blindly applying the technique often leads to a suboptimal result. In this article, I’ll explore some of its tradeoffs with a fictional story of an incident in production at Acme Corp.

crew

I’m sorry. The words made sense, but the sarcastic tone did not.

— Lana Kane

Meet the crew

Let’s start by setting the scene. A team of engineers is working on a new Acme News website. Suddenly, an incident occurs, which gets resolved quickly. A lengthy investigation then follows the incident to understand what happened and how similar issues can be prevented in the future.

The standard process of achieving this would be to write a postmortem, which can outline the story and propose a set of action items. A postmortem template can simplify the process by including all necessary steps to identify the root cause and might suggest using an analysis, for example, the five whys analysis. The question of who is ultimately responsible for authoring the postmortem is tricky, and the answer depends on many factors. Let’s take a look at what everyone on the team would write if they were the author.

Meet Alice

alice

As we know, Alice is an on-call engineer. Her official title is SRE, and her job involves a deep understanding of the service state based on the available metrics. She has a solid understanding of service health indicators and a good grasp of the monitoring infrastructure. When she started the investigation, she noticed that the application was overloaded due to high CPU usage. As recommended by the postmortem template, she began the five whys analysis.

Five Whys analysis:

  • Why were the users not able to open the website? Because the application was overloaded, and it started returning errors.

  • Why was it overloaded? Because the CPU went to 100%, and the instances didn’t have enough capacity to process the requests.

  • Why did the CPU usage jump so high? Because we didn’t notice it earlier when it climbed to 90% and didn’t fix the issue in time.

  • Why did we not notice it? Because there were no alerts for 90% CPU usage.

  • Why were there no alerts? Because the alerts were removed two years ago after a migration to a new monitoring system.

As the Five Whys analysis was completed, the root cause was "identified". The main action item in the postmortem was to re-instantiate high utilisation - 90% CPU alerts to catch similar situations earlier and mitigate the issue before users are effect. Now that all of the questions are answered, and the lessons are learned, Alice is off to another incident.

Meet Bob

bob

Bob is the backend engineer who implemented the feature in the first place and flipped the flag. He has the most context around the change, so it would make sense for Bob to write a postmortem. Bob can use the Five Whys analysis to complete the postmortem and understand how to roll out all future changes smoother.

Five Whys analysis:

  • Why were the users not able to open the website? Because the application was overloaded.

  • Why was it overloaded? Because CPU jumped to 100%.

  • Why did it jump? Because the application went into an infinite GC loop.

  • Why did GC start consuming all of the CPU? Because the application ran out of heap memory.

  • Why did it run out of heap memory? Because a new request-level cache was added with unlimited size.

The new feature that Bob implemented enabled users to see article recommendations on the Acme News website. To reduce the load to the recommendation service, Bob added a cache. However, Bob forgot to add a maximum size parameter. Bob then started thinking about how this could be prevented in the future. He decided to change the Cache interface to ensure that max size always has to be provided. This task was added as the main action item to the postmortem.

Meet Charlie

charlie

Charlie is a frontend engineer who implemented the feature on the frontend side and is responsible for Acme Corp News overall look. Charlie is also a product owner of the new Acme Corp Recommendations ™, who cares a lot about the product’s availability, so it might make sense for her to start a postmortem.

Five Whys analysis:

  • Why were the users not able to open the website? Because the application was overloaded.

  • Why was it overloaded? Because there was an influx of requests that caused high CPU usage.

  • Why was there an influx? Because the users started sending a lot of requests.

  • Why did they do this? Because each request was retried up to 10 times.

  • Why were there so many retries at the same time? Because there was no exponential backoff.

Charlie’s mental model of the application is primarily based around the request flows. When looking at the application logs, she noticed many retries from the same set of users. When looking at the code that fetches the recommendations, she saw that all of the requests were retried immediately without any exponential backoff or jitter. Hence, she put it as an action item with the plan to implement it by Friday.

Meet Dave

dave

Dave is an engineer on the cloud infrastructure team, and he is responsible for the overall health of the Acme News Corp cluster. Dave cares a lot about reliability, so he might volunteer to perform the investigation and complete the postmortem.

Five Whys analysis:

  • Why were the users not able to open the website? Because the application was overloaded.

  • Why was it overloaded? Because the load on every instance increased sharply.

  • Why was the load per instance so high? Because we didn’t have enough instance capacity to distribute the load.

  • Why not? Because our autoscaling policies didn’t scale fast enough and only added a few more instances.

  • Why? Because the autoscaling cooldown time was too large.

For Dave, each task in the cluster is essentially a black box that can handle a certain number of requests. For each type of task, Dave manages policies that define how the applications can scale up or down. After looking at the graphs, Dave notices that even though a few instances joined the fleet when the load increased sharply, the cooldown prevented autoscaling from bringing up even more tasks to handle the load. Dave writes down a task to update autoscaling policies to allow for very aggressive scale-ups to handle the load as the main action item.

Meet Erin

erin

Erin is an engineer on the core libraries team, and she is working on a shared set of core libraries. When she first heard about the incident and that it was related to the cache library, she volunteered to write a postmortem as the owner who doesn’t shy away from responsibility.

Five Whys analysis:

  • Why were the users not able to open the website? Because the application was overloaded.

  • Why was it overloaded? Because it was out of memory

  • Why was the application out of memory? Because the cache took too much memory.

  • Why did the cache take up that much memory? Because our cache libraries are not optimised for memory usage.

  • Why is the cache library not optimised for memory usage? Because we always valued throughput above memory usage.

Erin wrote the original cache library. The library has been a great success as it is incredibly optimised for Acme Corp’s use cases. Nonetheless, so far, the focus was the throughput and not the memory usage. After looking at the library from a different perspective, Erin noticed a few opportunities to share the underlying data structures and wrote down an action item to provide a version of the cache explicitly optimised for memory usage.

The analysis

Let’s zoom out and take a look at the results of the exercise. There are five people and five different stories. The key takeaway here is the results of the why the Five Whys analysis are not repeatable. The outcome heavily depends on the angle at which the person performing the analysis looks at the incident. Dave’s point of view on what happened is very different to Charlie’s point of view as they work on very different parts of the system day-to-day, which heavily influences the analysis.

But even looking at a single investigation, we can notice that the number “five” itself is interesting. It’s catchy, it’s easy to remember, but it unnecessarily limits the depth of the investigation. Maybe Alice should’ve done some code archeology to figure out why the alerts were removed after the migration. Bob could have dug deeper to collect some of the heap dumps to understand the distribution of data in the cache. Why stop at five if it makes sense to go further?

The root cause

As we saw, the Five Whys analysis has downsides like any other approach. Yet, it can be used successfully to dive deeper! So are these issues big enough to invalidate the approach?

The real problem reveals itself when the technique becomes a part of a template. Once it’s in the template, the shape of the analysis becomes solidified. Whether it’s a part of an engineering postmortem template or even a government worksheet - the template restricts the analysis by forcing its limits onto the investigator. The downsides of the analysis become the downsides of the postmortem.

Well, if not using five whys, then how would we find the root cause? First, it’s important to understand that rarely there is such thing as a root cause. Often there would be multiple causes that can be seen when looking at the incident from different angles.

Second, and more importantly, it’s not about finding the root cause at all. It’s about getting an inside out understanding of what happened — and then, based on this understanding, defining what actions can be taken to ensure that the incident doesn’t repeat in the future. Often, these action items can be very distant from the root cause. For instance, maybe Acme News could restrict the blast radius and ensure that the page loads even if recommendations are not available. These mitigations might not always come up when searching for a root cause. This poses a question - if not the Five Whys analysis, then what should be in the template? It can’t be empty after all!

Before looking for a replacement, it’s important to understand that the template should encourage building an understanding and a detailed story rather than searching for a root cause. In The Infinite Hows, John Allspaw highlights that "Learning is the goal. Any prevention depends on that learning." and proposes using Debriefing Facilitation Prompts from The Field Guide To Understanding Human Error, by Sidney Dekker.

Having the prompts could be a great starting point as it asks you to take a look at the system from multiple angles and understand how each individual part of the system behaved during the incident. Moreover, the approach could further be expanded with the questions from the environment - the questions that can help build the story and facilitate learning. There could even be different prompts for different components that were affected. If Alice or Bob, or any other member of the crew were to use the prompts, they would immediately have to consider how the system behaved from multiple angles giving the investigation the necessary depth.

Conclusion

Five Whys analysis is a useful technique that is easy to remember and a helpful reminder to always look deeper. However, using it directly would only yield a cause, not the root cause, as rarely there is such thing as the root cause. The approach could still be used to deepen the scope of searching for action items. Still, structuring this tool into a template as the primary driver of a root cause analysis can cause more harm than good.

Instead, the template could facilitate learning and help build a story by asking questions that can help understand the whole story deeply. Then, a comprehensive list of action items can stem from the story preventing not only incidents with a similar "root cause" but a whole class of failures.

Thank you to

  • Paul Tune for reviewing the article.

  • You for reading the article.

Share this article


Tags: postmortems