On January 31st, between 17:30 UTC and 00:30 UTC, GitLab experienced a database incident which involved losing, without an option of recovery, 300GB of customers’ data and 18 hours during which the service was not available. On Feb. 10, GitLab published a post-mortem analysis, detailing the stream of events leading to the failure as well as their root-cause analysis. The analysis is candid, and highlighted most of the variables that could have caused the incident. It is humble, which is a core value for a successful analysis. GitLab should also be commended on their overall transparency during the event - it was reported in real-time in a public Google Doc and a live YouTube stream.
That being said, the analysis skips some of the key variables which led to the event, and therefore fails in uncovering the true root-cause of the incident. That, in turn, leads to an incomplete plan for preventing this from happening again. We attribute these flaws in the analysis to common cognitive biases we very often see in root-cause analysis processes (not only in the tech industry!). These primarily relate to the analysis of the human errors which took part in the incident, leading the analysis to put an emphasis mainly on the errors which occurred during the event or directly led to it (the direct-cause), neglecting the background environment which allowed the event to happen (the root-cause). These biases are Outcome Bias, Confirmation Bias, and Mechanistic Reasoning, and in this article, we’ll explore what they mean, how they affected GitLab’s analysis, what the deeper root-cause of the outage is, and how it could be prevented in the future.
The root-cause of the database outage can be attributed to 3 kinds of factors - human failure, machine failure, and the mutual effect of each kind of failure on the other, all of which shall be considered the ‘System’ for our analysis.
The GitLab analysis thoroughly described which machine components of the system failed, at what time and the reasons for those failures:
- The database was overloaded because of spam and a background job to delete an employee’s user, which in turn led to a WAL (Write-Ahead-Logging) replication fail between the primary database and the secondary database.
- The background job to delete the employee’s user was a result of an abuse report submitted by a troll, which the system reacted to as a legitimate report.
- The daily backup using pg_dump was constantly failing because of a default version setting of the command which was different than the systems version.
- The backup was failing silently because the error alerts from that process were being sent to an email notification service without the relevant DMARC certificate that allows them to be forwarded with that service.
In addition, GitLab described two main human failures which led to the outage:
- An engineer deleted the data directory from the wrong database, which is considered a slip (a failure due to lack of attention).
- There was no validation of the different database backup procedures or documentation detailing the correct and incorrect behaviors, which is considered a mistake (a failure due to rules, procedures, and prior knowledge).
When analyzing failures, especially trying to find solutions for the prevention of those failures in the future, we need to assume that the failures that occurred were logical behaviors of the System given its configuration. These are the failures which are derived from the mutual effects the machine failures have on the human ones and vice versa. Not doing so, i.e. assuming that the failures were results of malfunctioning components of the system, is called Mechanistic Reasoning. It is logical that the WAL replication between the databases failed, because of a configuration that wasn’t programmed for the overload the server faced. The reporting system behaved logically in response to an abuse report against the GitLab employee. And it is logical that the engineer deleted the data directory in the wrong database, because there was no indicator that he was working on the primary one, as to prevent human errors.
Under the assumption that the System behaved logically given its configuration, all the direct causes for the outage are true, but are derived from a deeper cause, which is the environment (including methodologies, practices, and perceptions), in which the system was configured. In this environment, DevOps engineers face a constant fear: “Alert Flooding” by different machine behaviors (some of which are critical but most are inconsequential), which leads them to devise ways to pre-screen alerts. This fear is a direct result of the tools which are available today to DevOps engineers, which allow mainly for log management by ways of prescreening, and thus limit the perception of the possible solutions to this fear. In addition, this environment includes the lack of validation for the different backup procedures, which led to the different backup and server configurations to be as they were when the event occurred.
Outcome Bias and Confirmation Bias
As stated above, GitLab described the different reasons for the problem, and they concluded that the root-cause was on one hand the lack of ownership for the backup procedures and on the other the abuse report system scheduling an employee’s user for removal. But what is the solution proposed? GitLab proposed 14 different actions it would take to prevent a similar event from happening in the future. However, all these solutions, as well as their definition of the root-cause fall to the Outcome Bias, a cognitive bias in which the effectiveness of decisions are determined by their outcome. This is because they only deal with preventing the actions which led to this specific database outage from happening in the future, but don’t deal with the system as a whole, thus preventing a database outage which could occur due to different actions in the system. Moreover, they fall to Confirmation Bias, a cognitive bias which is the tendency to search for, interpret, favor, and recall information in a way that confirms one's preexisting beliefs or hypotheses, because they appear to offer solutions which are only available with existing and known tools and methods.
We offer four simple solutions, that can be technologically implemented, can deal with the root-cause of the GitLab data outage, and can prevent such outages from happening in the future:
- Make the erroneous human decisions that currently make sense while interacting with a machine stop making sense in the system. Thus, preventing the human from making a wrong decision (e.g. a different color schemes for each environment, a prompt reminding the human of the server in which he is in or the file he is trying to delete, etc.).
- The documentation of knowledge from different employees regarding the procedures and behaviors of the machine needs to be integrated into the system, thus making it more accessible to the whole organization (in order to prevent dependency on a single engineer who knows the meaning of the different machine behaviors).
- Address the engineers’ “Fear of Alert Flooding” by improving the monitoring stack, and just as important, tracking and controlling the rate and quality of alerts. We recommend taking a look at what the guys at HumanOps have to say about this.
- Monitor when good behaviors stop. This is counter-intuitive to system monitoring, which usually alerts only when bad behaviors occur. However, many times this is a better indicator that some error has occurred. One way to do so is by using Health Dashboards, but they only deal with the system metrics. Another option is using a smart log analysis tool.
Root-cause analysis is a perplexing and arduous endeavor, and as such is only rarely carried out correctly. We’ve been studying how different Operations teams undertake this process for years, and have applied much of the lessons learned and common sense gathered by these teams into the solution we build - Loom