Root cause analysis in software development: Why there are no project manager and pilot errors

The pilots were to blame for the crash.

When a plane crashes, the media very often write about pilot error as the cause of a tragic accident shortly after the event - as was the case with the crash of the Auntie Ju 52 on 4 August 2018 near Flims (GR) with 20 fatalities. When the final report of the Swiss Transportation Safety Investigation Board (STSB) on the crash was published at the beginning of 2021, I was amazed by media headlines such as "Pilot error led to 'Auntie Ju' crash of 2018". (SRF)

Explanatory video on the Ju 52 accident of 4 August 2018 from the Swiss Transportation Safety Investigation Board STSB

In its mandate for prevention, the STSB must comment on all risks and hazards that have had an impact on an investigated incident and should be avoided in the future, according to the final report on page 75. The STSB explicitly mentions that determining the causes and factors that led to the accident is by no means an assignment of blame. This is an enormously important statement and task when investigating and analyzing events, which is unfortunately almost always ignored by the media.

Interested parties can read the causes and contributing factors of the air accident from the STSB's final report in detail in the Final Report No. 2370 of the Swiss Transportation Safety Investigation Board STSB.

Reasons for the crash
The direct causes of the crash
- The flight crew operated the aircraft in a high-risk manner by flying it into a narrow valley at low altitude and without the possibility of an alternative flight path
- The flight crew chose a dangerously low flight speed in relation to the flight path
Direct contributing factors to the crash
- The flight crew was accustomed to disregarding recognized rules for safe flight operations and taking high risks, the flight crew was accustomed to not complying with recognized rules for safe flight operations and to taking high risks
- The aircraft involved in the accident was operated with a centre of gravity that was outside the rear limit, which facilitated the loss of control
Systemic causes of the crash
- The conditions for commercial air transport operations of the aircraft weren't met against the background of the legal bases applicable at the time of the accident.
Systemic contributing factors for the crash
- The mass and centre of gravity calculation of the Ju 52 of the flight operations company could only be carried out incorrectly due to inadequate work equipment
- In particular, the flight crews of the flight operations company, who were trained as air force pilots, were in the habit of systematically failing to comply with recognized aviation rules and taking high risks when flying the Ju 52
- The flight operations company was unable to recognize or prevent the defects and risks that occurred during operations and the frequent breaches of rules committed by its flight crews
- Numerous numerous incidents, including several serious incidents, weren't reported to the responsible bodies and authorities, which meant that they were unable to take any safety-improving measures
- The supervisory authority was sometimes unable to recognize the numerous operational deficiencies and risks or to take corrective action
- The flight operations company was unable to recognize the numerous operational deficiencies and risks or to take corrective action.
Other risks that contributed to the crash

The STSB report identifies the following factors that didn't have a direct impact on the accident but would have led to an improvement in flight safety (so-called "factors to risk").
- The aircraft wasn't in a proper technical condition
- The aircraft no longer achieved the flight performance originally demonstrated
- The maintenance of the aircraft of the flight operations company wasn't organized in a targeted manner
- The training of the flight crews with regard to the specific requirements of flight operations and in crew resource management was inadequate
- The flight crews weren't familiarized with all critical situations with regard to the behavior of the aircraft in the event of a stall
- The supervisory authority didn't recognize or correct numerous technical deficiencies
- The flight crews weren't familiarized with the behavior of the aircraft in the event of a stall.
- The expertise of the persons deployed by the flight operations company, the maintenance organizations and the supervisory authority was inadequate in some cases

At first glance, all the causes and factors, especially those relating to the direct causes of the accident, clearly point to pilot error. However, this raises the all-important question of why it's possible for an aircraft crew to take so many risks and ignore the very strict rules and legal principles of aviation for years without the authorities intervening? Is the fault possibly in the system?

Finger pointing: There are no pilot and project manager errors

Problems are very often obstacles in projects that need to be removed as quickly as possible in order to reach a harmonious target situation. However, in order to effectively solve a problem, it must first be recognized. This is only possible if an organization cultivates and lives an open problem and error culture and is willing to learn from discovered errors in order to improve in the long term. The aim is never to determine who's to blame through finger pointing, but rather to identify the errors in the system, analyze them and work together to find sustainable solutions that prevent the errors or problems from occurring again.

At Apps with love, the project managers are the pilots of the projects. They plan, coordinate, manage and communicate all project-related activities and are very often the direct point of contact for our customers. Project managers bear a great deal of responsibility for delivering projects on time, within budget, within scope and to a high quality standard. It's also important to offer customers a good experience and ensure their satisfaction. All of this is a fine art and, as I know from my personal experience, very demanding.

The idea to write this blog post arose on the one hand from the published STSB report and the media finger pointing, and on the other hand we received several software errors reported at the same time for an application that we had developed and was in operation: Around a month after a productive release of a larger project, first one and then several critical software errors (so-called incidents) were discovered by the customer and reported to us. You can read about the individual steps involved in analyzing and resolving such incidents below or in detail on the support page.

Individual steps
in analyzing and resolving an incident
- The software error or application malfunction is reported to us
- The help desk records the software error in the support and issue tracking tools
- The software error is then analyzed and reproduced
- The severity of the software error is determined
- The developer responsible is scheduled to rectify the software error
- The software error is reassessed and new specifications, designs or interfaces are required depending on the impact of the error; the software bug is fixed on the test environment
- The software bug is fixed and tested on a staging environment
- Tests and regression tests are carried out
- Depending on the type of software bug, new app releases are initiated for the app stores
- The new increment with the fixed software bug is delivered and accepted
- The software bug is released on the production environment
- Tests and regression tests are carried out again
- Done!

Due to the often prevailing time pressure during the elimination of a software error in a productive system and the lack of time to systematically uncover the actual problem, often only the symptoms are eliminated. The actual cause of the problem remains, with the high probability that the same or a similar software error will occur again soon.

In our case, this is exactly what happened. A short time after we had patched and redelivered the software, a new, relatively similar error was reported to our support team. The game started all over again.
In order to find the cause of the problem, we decided to carry out a comprehensive root cause analysis. The intention behind this was to improve the quality of our projects and our organization in the long term (continuous improvement). A rather time-consuming, complex and sometimes emotional process. Maud Cottier, my former QM colleague, and I set about applying the methodology we both knew from our student days to a "real" project.

What's a root cause analysis (RCA)?

A root cause analysis is a methodical process for identifying the underlying problems of an event and therefore serves as a retrospective analysis of an incident. It analyses when, how and why a problem arose. It's assumed that problems are always part of a cause-and-effect chain and that a problem can't be solved until the root cause has been eliminated.

This raises the following fundamental questions that should be answered by carrying out an RCA:

What were the causes of the problem(s) at hand?
What's the sequence and relationship between the causes?
Why was there an error in the first place?

Methods for carrying out a root cause analysis

There are various methods to choose from when carrying out a root cause analysis. The most common method is the 5-why or 5-why-question method supplemented with a visual representation of the causes in a cause-and-effect diagram.

5-Why-Method

The 5-Why method is a popular procedure for root cause analysis and a tool for analyzing problems and their causes. Together with protagonists, a localized problem is examined by asking "Why did this happen?" five or more times. The "Why?" questions (there can be more than five, the number is to be understood symbolically) should ultimately lead to the cause of the problem.

The 5 Why method has its origins in production control and is a tool within quality assurance for analyzing the causes of processes and systems, which can very often be linear, causal and complicated. The method was invented by Toyoda Sakichi and is now an integral part of the entire lean philosophy.

The 5 Why method is easy to apply and can be learned very quickly. The first step is to formulate the problem or issue in simple terms. The problem is then explained to the protagonists and the first "Why?" question is asked. The "Why?" questions are repeated until the root cause, i.e. the origin or root of the initial problem, has been identified.

Visual representation of the causes with the cause-effect diagram

The cause-and-effect diagram can be used to cluster, structure and visualize potential causes for a problem that has arisen. The causes are identified and interrelated. One of the most frequently used forms of visualizing the causes of quality problems in quality management is the Ishikawa diagram, also known as the fishbone diagram, which can be used to present the causes and the associated relationships between them simply and clearly.

In our case, the content of the fishbone diagram is derived from the analysis of the incident and the results of the 5 Why method. The results of the analysis, i.e. the causes that may have led to the problem, are now assigned to the individual categories - material, method, management, environment, machine, human - with the help of arrows. This results in small branches within these categories, which are also called fish bones;

Goals of the root cause analysis of the incidents that have occurred

Before we started with the root cause analysis, we asked ourselves once again what exactly we wanted to achieve by carrying out the RCA. To do this, we took another look at the initial situation: Around a month after the productive release of a major project, first one and then several critical software errors were discovered by the customer and reported to us.

We asked ourselves the following questions and wanted to answer them:

What factors led to Apps with love delivering a non-functioning software increment?
What can Apps with love improve in the long term to prevent such an incident?
Can we reduce the risk for us as a company in the long term through root cause analysis?

Procedure and instructions for carrying out a root cause analysis

So how did we go about carrying out the root cause analysis in our example?

Step 1: Decision to carry out the RCA. A joint decision was made to carry out an RCA with the aim of continuously improving the organization and establishing unusual methods for improvement. The RCA project team and the assignment were defined.

Step 2: Defining the RCA project remit. It was important to define the project remit for carrying out the RCA, set clear objectives, set a time frame, differentiate it from other projects running at the same time and carry out the RCA as independently as possible. The team and the protagonists, the tools and instruments for analyzing and the method to be used were defined.

Step 3: Development of the data basis and chronological documentation of all information. Step 3 was the most time-consuming step, but also the most important. All data had to be documented for further analysis.

So here's what we did:

Chronological documentation of the reported incidents
- In some cases, missing information on the incidents had to be processed and supplemented through discussions with individual project employees: Once again, we realized how important continuous documentation is for traceability.
Analyzing the tasks carried out and developed and comparing them with the requirements for the product
- Analyzing discussions and comments in our communication tool Slack
- Reverse engineering and analyzing software errors that had already been fixed and their subsequent documentation
- Analyzing and chronologically documenting the software releases and deployments for various project phases and releases and deployments to the various development environments
- Analyzing the project retrospectives and project debriefings

The reappraisal helped us to isolate and delimit the problems.

Step 4: Conducting workshops with individual protagonists of the project and jointly analyzing the causes using the 5 Why method. In step 3, we were able to categorize the problems thanks to the detailed data base. This was the basis for the workshops with the individual protagonists. The 5 Why method is used to go through the localized problems individually and describe and document the causes from the perspective of the protagonists at a meta-level. It's important to note that the 5 Why method doesn't say that the cause is clear after the fifth "Why?". It may well be that 7 or more "Why?" questions are needed.

Step 5: Clustering all causes using the fishbone diagram. After the workshops with the protagonists, the causes could be clustered at the meta level and visualized using the fishbone diagram in order to obtain a holistic overview of the causes that contributed to the events.

Step 6: Deriving recommendations and possible measures for improvement. In the next step, the first recommendations and measures could be formulated from the documentation of the 5-Why workshops, the recognized causes and the visual representation of these. It was exciting to see that the first measures were already initiated and tackled by the project team during the implementation of an RCA, before the RCA was actually finalized. Something that seems to happen more often when RCAs are carried out.

Step 7: Presentation of the measures to the project team and then to the individual specialist departments. An important part of the RCA is the presentation of the recommendations and possible measures to the individual specialist departments and teams and their handover for further processing by them.

Step 8: Review implementation of recommendations and measures. It's important to review the implementation of recommendations and measures at regular intervals, enquire about the status and continuously document solutions for improvement. Processes are very often improved and adapted based on the recommendations of an RCA.

Conclusion: Continuous improvement of software development processes with root cause analysis

In order to improve quality within an organization, the real causes of a problem must be identified, analyzed and addressed. Techniques such as the 5-Why method and the fishbone diagram are used to sustainably eliminate errors and optimize processes and workflows.

The decision to carry out a root cause analysis after an incident must be well thought out, it isn't always worth the effort, especially in software development where many things are always completely new. The decision to carry out an RCA was quickly clear in our case, as further software errors were reported after the initial rectification of the reported software errors.

By carrying out the RCA, we weren't only able to establish a new methodology within the company, but also identified some of the causes that led to the software errors. In a further step, these causes are now leading to some internal processes, associated workflows and process documentation being reconsidered and adapted. So the work isn't yet done after the RCA 😉.

Sources

Martin Mattli

Head of Operations & Quality Management

What else might interest you

29. March 2022 - from Michael Schranz · 12 min read

Cognitive errors in digital business - part 1

Avoiding cognitive errors during software development is important to make the right decisions and avoid unpleasant surprises.

23. November 2020 - from Barbara Sollmann · 6 min read

Visualize requirements and create common understanding with user story mapping

The method helps to put the users in the center during software development and not to lose focus on the essentials.