Machine learning and AI applications are advancing in increasingly critical domains such as medicine, aviation, banking, finances, and more. These applications not only are shaping the way in which industries are operating, but also how people are interacting and using their platforms/technologies.
That said, it is of fundamental importance that the engineering culture in ML / AI incorporates and adapts more principles from other fields of engineering in terms of both reliability and robustness of the development of solutions to their problems. The understanding of causal aspects helps to understand failures and can help to decrease the risk of unavailability or misbehavior of ML systems.
There is a myriad of information on virtually any technical area on the internet. With all the hype in ML and its increasing adoption, this information materializes in the form of tutorials, blog posts, slack forums, MOOCs, Twitter, among other sources.
However, a more attentive reader may notice a certain pattern in part of these stories: Most of the time they are cases of something that (a) worked extremely well, (b) or that generated revenue for the company, (c) or as the solution o saved X% in terms of efficiency, and / or (d) how the new technology solution was one of the greatest technical wonders that have ever been built.
This yields claps in Medium, posts Hacker News, articles on major portals technology, technical blog posts that you saw technical references, papers and more papers in Arxiv, lectures at conferences, etc.
Beforehand, I want to say that I am very enthusiastic about the idea that “intelligent people learn from their own mistakes, and wise people learn from others’ mistakes’”.
But, after all, what does all this have to do with the failures that happen, and why is it important to understand its contributing factors?
The answer is: Because first your team/solution has to be able to survive catastrophic situations for the successful case to exist. And having survival as a motivating aspect to increase the reliability of teams/systems, makes understanding errors an attractive way of learning.
And when there are scenarios of minor violations, suppression of errors, lack of procedures, malpractice, recklessness or negligence, things go spectacularly wrong, as in the examples below:
- Amazon: The data on a load-balancer was deleted and this caused an interruption virtually an entire AWS region at the time;
- Gitlab: A deletion of a production database led to an 18-hour unavailability with loss of customer data;
- Knight Capital: Lack of culture code review allowed an engineer to place parts of code that had an outdated business rule 8 years before deployment in production, and this led the company to lose $ 172,222 per second for 45 minutes ( or U $ 465 million). The final investigation can be found here on the SEC’s website;
- European Space Agency: A conversion from a 16-bit to 64-bit number caused an overflow in the rocket steering system that triggered a chain of events that caused the rocket to be destroyed and a loss of more than $ 370 million; and
- NASA: A degradation from an engineering culture to a design / political culture and sealing ring problems led to a catastrophic failure that not only cost billions of dollars but also claimed the lives of the crew. This degradation of culture can be seen in Diane Vaughan’s excellent book called The Challenger Launch Decision: Risky Technology, Culture, and Deviance at NASA, Enlarged Edition.
The Swiss Cheese Model
In aviation, each catastrophic event that occurs, there is a thorough investigation to understand what happened, and then address the contributing factors and factors determining a new catastrophic event that will never happen again.
In this way, aviation ensures that by applying what has been learned due to the catastrophic event, the entire system is more reliable. It is no accident that even with the increase in the number of flights (39 million flights in the last year, 2019) the number of fatalities has been falling with each passing year.
One of the most used tools in the investigation of aircraft accidents for the analysis of risks and causal aspects is the Swiss Cheese Model.
This model was created by James Reason through the article “The contribution of latent human failures to the breakdown of complex systems” in which his framework was built (but without direct reference to the term). However, only in the paper “Human error: models and management” does the model appear more directly.
A way to visualize this alignment can be seen in the graph below:
That is, in this case, each slice of Swiss cheese would be a line of defense with projected layers (eg monitoring, alarms, locks push code in production, etc.) and/or the procedural layers that involve people (eg, cultural aspects, training and qualification of committers in the repository, mechanisms rollback, unit and integration tests, etc.).
Still, within what the author put, each hole in one of the cheese slices happens due to two factors: active failures and latent conditions, in which:
- Latent conditions are like a kind of situation intrinsically resident within the system; which are consequences of design, engineering decisions, who wrote the rules or procedures and even the highest hierarchical levels in an organization. These latent conditions can lead to two types of adverse effects, which are situations that cause errors and the creation of vulnerabilities. That is, the solution has a design that increases the likelihood of high negative impact events that can be equivalent to a causal factor contributing.
- Active failures are unsafe acts or minor transgressions committed by people who are in direct contact with the system; these acts can be mistakes, lapses, distortions, omissions, errors, and procedural violations.
In our Swiss Cheese, each of the slices would be layers or lines of defense in which we have aspects such as systems architecture and engineering, the stack technology, specific development procedures, the company’s engineering culture, and finally people as the last safeguard.
The holes, in turn, would be the flawed elements in each of these layers of defense that can be active faults (eg, commit directly to the master because there is Code Review) or latent conditions (eg ML library, lack of monitoring and alarmistic).
In an ideal situation, after an event of unavailability, all latent conditions and active failures would be addressed and there would be an action plan to solve the problems so that the same event would never happen in the future
Of course, there is no panacea in terms of what can be done in terms of risk management: some risks and problems can be tolerated, and often the time and resources needed to apply the necessary adjustments do not exist.
Understanding the contributing and determining factors in failure events, can help to eliminate or minimize potential risks and consequently reduce the impact on the chain of consequences of these events.
To understand more about the inherent risks with ML operations and get some practical and actionable insights to mitigate them, consider attending my talk at the ODSC Europe Session from September 17th – 19th, “Machine Learning Operations: Latent Conditions and Active Failures.”
Reason, James. “The contribution of latent human failures to the breakdown of complex systems.” Philosophical Transactions of the Royal Society of London. B, Biological Sciences 327.1241 (1990): 475-484.
Alahdab, Mohannad, and Gül Çalıklı. “Empirical Analysis of Hidden Technical Debt Patterns in Machine Learning Software.” International Conference on Product-Focused Software Process Improvement. Springer, Cham, 2019.
About the author: Flavio Clesio
Flavio works with information technology, he has acquired experience in some companies in different businesses. Most of these roles were performed in parallel or throughout my overall experience. Currently works as head and machine learning engineer in a chapter of core machine learning, embedded algorithms at telecommunication business platforms for mobile products.