• Rally
  • Security Through Resilience

Security Through Resilience

By Andrew Kirsch | April 7, 2020

As a Senior Software Engineer on the Application Security team at Rally, I have a good concept of the small practices that make applications secure. I spend a great deal of time evaluating the security of systems in order to determine where to spend effort to make them stronger. Looking for answers to where to deploy resources from the risk measurement space felt like doing statistics on averages or making up numbers to feel like you are making data driven decisions.

I had come across the book The Field Guide to Understanding ‘Human Error’ and upon reading it, I started to see the parallels between the way the human factors engineering, or how built things can take into account the people who use them, looks at safety and the security metrics I was thinking about. Both deal with a large degree of uncertainty stemming from difficulties of measuring the problems. In both fields just because there was no problem does not mean a human was engaging in steps to ensure that outcome; conversely, just because something went wrong it doesn’t mean a human did something wrong. Safe and secure are both states you strive to be in relative to your environment, not a set of actions you take. Individuals can take actions to help achieve those states.

Safety and security are not just the shell of initial prevention or defense in depth, they also encompass being able to respond once you have new information. In software that can mean slack in the schedule to ship a security fix, for safety that could mean everyone feeling empowered to stop the work to deal with a potentially unsafe situation. More interestingly, in software that means things like investments in CI/CD can also be seen as security investments since it means you can ship a patch faster and with more confidence.

The concepts of safety and security both look a lot like the concept of resilience. All three concepts revolve around the lack of a thing happening did not mean you controlled it. Resilience is hard to define because it is so abstract. The definition I like is “the ability of a system or team or individual to recognize, adapt to, and absorb disruptions that fall outside the design or preparation base, and to sustain or even improve its functioning.” The key phrase in that definition is ‘outside the design’. If you expected a problem you would specifically account for that potential problem. That an application can crash therefore you run multiple instances of it to improve reliability is not resilience; that is anticipating a specific sort of problem and preparing for it. As mentioned above, CI/CD could be part of resilience since it provides you a capacity to make other changes, for whatever reason, faster and more reliably.

Another definition of resilience is proactive activities aimed at preparing to be unprepared. This puts more agency into achieving resilience than the initial definition. Agency towards resilience means that a technical system cannot be resilient in isolation of the people who are responsible for it. Some aspects of a system help contribute to resilience regardless of any context; quality documentation or a good automated test suite can improve resilience a little since it would help a new person make a change. However the people are still the key for creating and maintaining resilience. For example, the documentation on a project should be reviewed thoroughly long before an application starts to have trouble.

The big difference between all of these concepts is that resilience is about the unknown unknowns whereas most safety and security activities focus around the known knowns and known unknowns. The unknown unknowns quote, while often the subject of jokes, represents a very real distinction of knowing roughly what could go wrong and the absolute breadth of possibilities that could happen. However having the ability to adapt to something that can’t be foreseen still helps you respond to things that could have been foreseen. Just reacting in the moment is never recommended. Having a better response ready is helpful.

The most important takeaway I got from this train of thought is that there are ways that a lot of good engineering practices tie back into security through their effect on resilience. Automated testing, decoupling, and having a system without ‘normal’ errors would all be empowering resilience. None of them would be sufficient for resilience alone but would all probably be prerequisites. Often, the number one missing piece for resilience in systems is engineering time. Organizations should always invest in capabilities that build resilience. Strong engineering organizations have the capacity and knowledge to deal with both anticipated and unanticipated issues as they arise.

From here you should go forth and learn more about resilience to see if these ideas resonate. I found this talk by John Allspaw a great introduction to the topic since it looks at it through the lens of software. Past that I found the STELLA Report covered several interesting cases and tried to bring out the commonalities therein, and this paper on complex systems failure enlightening.

Andrew Kirsch

Keep Exploring

Would you like to see more? Explore