Nancy Leveson is professor of Aeronautics and astronautics at MIT. She is one of the worlds’ leading researchers on safety, a very serious researcher. I’m using some of the techniques she has developed analyzing complex systems for safety. Her papers are often interesting, but the title of her latest paper blew my mind when I read it:
I was born in 1969 and grew up during the cold war. One of the dangers we feared was that a mistake would happen, a bomb would detonate over Russia, Europe or the US, and uncontrolled retaliation would end the world. Dr. Strangelove, the movie, immortalized this scenario.
Take a deep breath if you watched the trailer above before reading on.
Leveson is not fearful. She has produced the paper for a workshop on systems and strategy stability and she looks back at why this horror scenario didn’t occur:
“The most successful complex systems in the past were simple and used rigorous, straightforward processes. Prevention of accidental detonation of nuclear bombs, for example, used a brilliant approach involving three positive measures […] and reliance on simple mechanical systems that could provide ultra-high assurance. Although there were a few incidents over a long period […] inadvertent detonation did not occur in those cases.”
The question she raises in her paper is whether we are still safe? Well, things are changing:
“The more recently introduced software-intensive systems have been much less reliable. […] More recently, our ability to provide highly trustworthy systems has been compromised by gratuitous complexity in their design and inadequate development and maintenance processes. For example, arguments are commonly made for using development approaches like X-treme Programming and Agile that eschew the specification of requirements before design begins.”
Yes, Leveson is a critic of the popular, modern development paradigms most of us has learned to love. Have we mistakenly stopped worrying?
I met her at a conference on STAMP/STPA in Iceland in 2017, and during a conversation which I was so fortunate to have with her in the lobby of the Iceland University she made herself very clear about her skepticisms towards Agile. But Agile is not the only problem:
“Providing high security is even more problematic. Again, only the most basic security techniques, such as providing an air gap to isolate critical systems, have been highly successful. The number of intrusions in today’s systems is appalling and unacceptable. Clearly what we are doing is not working.”
Leveson suggests a paradigm shift and suggests what the shift can look like. In the paper she discusses systems theory, and how approaches like those she describes in her 2010 book Engineering a Safer World can be useful.
At the core of innovation in IT is someone getting the idea of connecting existing services and data in new ways to create new and better services. The old wisdom behind it is this:
The Whole is Greater than the Sum of its parts
There is a flipside to this type of innovation that the opposite is also true: The whole can become more problematic than the negative sums of all the known risks.
My experience as a tester and test manager is that projects generally manage risks in individual subsystems and components quite well.
But I have on occasions found that we have difficulty imagining and properly taking care of things that might go wrong when a new system is connected to the infrastructure, subjected to real production data and actual business processes, and exposed to the dynamics of real users and the environment.
Safety, Accidents and Software Testing
Some years ago, I researched and came across the works of Dr. Nancy Leveson and found them very interesting. She is approaching the problem of making complex systems safe in a different way than most.
Leveson is professor of aeronautical engineering at MIT and author of Safeware (1994) and Engineering A Safer World (2011).
In the 2011 book, she describes her Systems-Theoretic Accident Model and Process – STAMP. STAMP gives up the idea that accidents are causal events and instead perceives safety as an emergent property of a system.
I read the book a while ago, but has only recently managed to begin the transformation of her ideas to software testing.
It actually took a tutorial and some conversations with both Dr. Leveson and her colleague Dr. John Thomas at the 5th European STAMP/STPA workshop in Reykjavik, Iceland in September to completely wrap my head around these ideas.
I’m now working on an actual case and an article, but have decided to write this blog as a teaser for other testers to look into Leveson’s work. There are quality resources freely available which can help testers (I list them at the end of this blog).
The part of STAMP I’m looking at is the STPA technique for hazard analysis.
According to Leveson, hazard analysis can be described as “investigating an accident before it occurs”. Hazards can be thought of as a specific type of bug, one with potentially hazardous consequences.
STPA is interesting to me as a tester for a few reasons:
As an analysis technique, STPA helps identify potential causes of complex problems before business, human, and societal assets are damaged.
One can analyze a system and figure out how individual parts need to behave for the whole system to be safe.
This means that we can test parts for total systems safety.
It works top-down and does not require access to knowledge of all implementation details.
Rather, it can even work on incomplete models of a system that’s in the process of being built.
To work, STPA requires a few assumptions to be made:
The complete system of human and automated processes can be modeled as a “control model”.
A control model consists of interconnected processes that issue control actions and receive feedback/input.
Safety is an emergent property of the actual system including users and operators, it is not something that is “hardwired” into the system.
I’d like to talk a bit about the processes and the control model. In IT we might think of the elements in the control model as user stories consisting of descriptions of actors controlling or triggering “something” which in turn produce some kind of output. The output is fed as input either to other processes or back to the actor.
The actual implementation details should be left out initially. The control structure is a mainly a model of interconnections between user stories.
Given the control model sufficiently developed, the STPA analysis itself is a two step activity where one iterates through each user story in the control structure to figure out exactly what is required from them individually to make the whole system safe. I won’t go into details here about how it works, but I can say that it’s actually surprisingly simple – once you get the hang of it.
Dr. John Thomas presented an inspiring tutorial on STPA at the conference.
Safety in IT
I have mentioned Knight Capital Group’s new trading algorithm on this blog before as it’s a good example of a “black swan project” (thanks to Bernie Berger for facilitating the discussion about it at the first WOTBLACK workshop).
Knight was one of the more aggressive investment companies in Wall Street. In 2012 they developed a new trading algorithm which was tested using a simulation engine. However, the deployment of the algorithm to the production environment turned out to be unsafe: Although only to be used in testing, the simulation engine was deployed and started in production resulting in fake data being fed to the trading algorithm. After 45 minutes of running this system on the market (without any kind of monitoring), Knight Capital Group was bankrupt. Although no persons were harmed, the losses were massive.
Commonly only some IT systems are considered “safety critical” because they have potential to cause harm to someone or something. Cases like that of Knight Capital indicate to me that we need to expand this perspective and consider safety a property of all systems that are considered critical to a business, society, the environment or individuals.
Safety is a relevant to consider whenever there are risks that significant business, environmental, human, personal or societal assets can be damaged by actions performed by a system.
STAMP/STPA and the Future of Testing
So, STPA offers a way to analyze systems. Let’s get this back to testing.
Software testing relies fundamentally on testers’ critical thinking abilities to imagine scenarios and generate test ideas using systematic and exploratory approaches.
This type of testing is challenged at the moment by
Growing complexity of systems
Limited time to test
Problems performing in-depth, good coverage end-to-end testing
DevOps and CD (continuous delivery) attempts to address these issues, but they also amplify the challenges.
I find we’re as professional testers more and more often finding ourselves trapped into frustrating “races against the clock” because of the innovation of new and more complex designs.
Rapid Software Testing seems the only sustainable testing methodology out there that can deal with it, but we still need to get a good grip on the complexity of the systems we’re testing.
Cynefin is a set of theories which are already helping testers embrace new levels of complexity in both projects and products. I’m actively using Cynefin myself.
STAMP is another set of theories that I think are worth looking closely at. Compared to Cynefin, STAMP embraces a systems theoretical perspective and offers processes for analyzing systems and identify component level requirements that are necessary for safety. If phrased appropriately, these requirements are direct equivalents of test ideas.
STAMP/STPA has been around for more than a decade and is already in wide use in engineering. It is solid material from one of the worlds’ leading engineering universities.
At the Vrije Universiteit in Amsterdam, the Netherlands they have people taching STPA to students in software testing.
The automobile industry is adopting STPA rapidly to manage the huge complexity of interconnected systems with millions of lines of code.