All human endeavors face uncertainty. When operating a complex enterprise, we face uncertainty in economic forces, equipment reliability, policy, weather, and innumerable other kinds of concern. For engineering systems, manufacturing operations, supply chains, and automated digital processes people have found that categorizing the likelihood of various events, and the severity of their outcomes, sheds light on business risks, safety concerns, and even fundamental strategic flaws of the business. One popular and effective method of gaining these insights is the Failure Modes, Effects, and Criticality Analysis (FMECA), also commonly known as an FMEA (the same thing, but without the criticality).
The FMECA was born in 1969 when the U.S. Department of Defense published MIL-STD-785, “Reliability Program for Systems and Equipment: Development and Production.” This standard defined the FMECA and other methods of evaluating and quantifying equipment reliability. The description of the FMECA's purpose is remarkably simple: to identify potential design weaknesses through systematic, documented consideration of the following: all likely ways in which a component or equipment can fail; causes for each mode; and the effects of each failure.
Since 1969, a standard specifically for the performance of FMECAs has been published (MIL-STD-1629, originally published in 1974 and updated in 1980). The purpose of the FMECA remains unchanged and the concept can be implemented very simply. It begins with an FMEA (i.e. without the criticality) whose goal is to identify the failure modes, their causes, mitigating provisions, and the severity of the outcomes of each failure. The greatest aid in performing the FMEA provided in MIL-STD-1629A comes in the form of…well, a form:
The criticality analysis (CA) then takes outputs from the FMEA and combines the severity of the outcomes with the likelihood of their occurrence. The performance of the criticality analysis is where I would like to focus.
This concept of criticality, coming from a combination of severity and likelihood, is not unique to activities formally defined as FMECAs. It is simple, easy to communicate graphically, can be applied almost anywhere, and is very effective in highlighting risks. Many business and government activities use the same type of thinking for operational risk management decisions. Many of them also do it incorrectly! Here is an example of the concept in active use, the basic risk assessment matrix from the U.S. Navy’s Operational Risk Management instruction:
This example gives a simple way of interpreting the interaction of probability and severity. It is ambiguous in terms of the magnitudes of severity and probability because it is intended for adaptation to a wide range of operational risk management activities (including the development of FMECAs).
The most frequent error I encounter with FMEAs and Criticality Analyses is a mischaracterization of event likelihood (and the example above, in the wrong hands, can lead to it). Organizations want to provide a simple means of gauging risk and so they ask those producing or interpreting FMECAs to do so with a severity and likelihood scale that looks something like the Navy’s chart above. Often, to aid them further, they’ll provide something somewhat more detailed and it may look like the example below:
This can help a great deal because operators can more easily gauge severity when they have reference points like these. Experienced professionals have likely seen enough during their time on the job to make a judgment about the likely results of a certain type of issue, and they can appropriately place the outcome in a severity scale if they are given these reference points. A similar scale for probability often looks like this:
And an operator will approach the probability scale similarly. They will draw on their experience to estimate whether an event described in the FMEA is likely or not. However, in the case of the probability scale provided above, we have allowed the operations manager to make a serious logical error. We have let him ignore the fact that the likelihood scale provided is not defined per any unit of time. As a result, he has adopted whatever unit of time may make sense to him (perhaps the length of his career, perhaps the length of a shift, perhaps some other scale like the length of a busy season). Daniel Kahneman shows, in Chapter 11 of Thinking, Fast and Slow, how anchors can influence the outcomes of a probability assessment using a scale like this. Whatever the evaluator thinks is “unlikely” will probably affect the way he assesses the probability of other events. If, instead, the likelihood ratings were provided as probability per unit time, or probability per operational event, then we could have greater confidence in the relative importance of the FMECA’s conclusions since these give concrete ways of assessing how often something actually occurs. Without defining the probabilities in this way, we really do not know how the team will assess the risks - and as a result we really do not know what our FMECA is telling us.
And this is not a new concept. The original incarnation of the FMECA made the same exhortation I do today. MIL-STD-785 gives the following guide for assessing likelihood:
For every probability level, it is clearly stated that the estimate is defined during the item operating time. For a system in continuous operation, one could also define likelihood of failure per day, per year, or per event. But if you leave it open-ended, you imply we are estimating the likelihood of something happening between now and eternity…for which the probability of anything is very near 100 percent!
Appreciating this concept is essential to the completion of the FMECA. The entire purpose of the exercise is to assess the potential impacts with appropriate weighting for their severity and their likelihood. It is only natural that the most severe events will be less likely. If that was not true, you would go find a different business to operate! Higher severity makes accurate prediction of the low likelihoods more critical to the quality of insights derived from the FMECA. Kahneman also shows (as does Nassim Nicholas Taleb in The Black Swan) that humans are notoriously guilty of under-appreciating rare events. So, for these higher severity events it is even more important that the FMEA/FMECA developer be given a good scale for probabilities, to increase the chances of a good assessment.
In closing, by giving a set of suggested probability values on a per-unit-time basis, we are more likely to get the FMEA/FMECA developer to seek out impartial knowledge of probabilities, or at least to assess them more carefully. Even an experienced operations manager is unlikely to know how frequently a certain type of issue causes an outage or results in equipment damage. To faithfully perform the FMEA/FMECA, she would need to research historical failures or find industry data. Organizations can often learn of important systemic vulnerabilities through the use of FMEA/FMECAs and they often do not require a great investment to develop, but to get the most out of them, they have to be done right and that starts with providing tools that make sense, including a properly defined probability scale.