Re-learning Lessons from the CrowdStrike Outage

Credit: BrianAJackson

As CrowdStrike’s CEO prepares to testify before Congress, one thing is glaringly evident – this unprecedented outage was preventable, and we should collectively learn from it.

The cause of the most significant internet outage event in history was a cascade of failures in both testing and deployment capability. The technical bugs in the testing and the client-side interpreter code are one area for improvement, and the process failures that propagated this so widely and quickly are another. Both functional areas need to be addressed to ensure we don’t have to endure an outage of this magnitude again.

“Never waste a good crisis” is a mantra to live by in the world of IT and cybersecurity.

The Code and Testing Bugs:

The CrowdStrike preliminary after-action report details that the initial cause of this incident was a bug in content type validation testing. That testing bug caused a bad channel update release, which then resulted in the boot loop crash when that bad channel update was interpreted by CrowdStrike Falcon Sensors on Windows hosts.

The CrowdStrike sensor interpreted the bad channel file and performed an out-of-bounds memory read, crashing each system to the familiar and dreaded Blue Screen of Death (BSOD).

The Rollout Process Bugs:

Bugs will happen to everyone, and testing failures can happen to any organization, but we’ve known how to help mitigate these types of issues for years. The congressional hearings will be an excellent opportunity to discover why CrowdStrike didn’t use these mitigating solutions, including staged rollouts and complete deployment control for customers.

If CrowdStrike had deployed the channel update among different testers across geographies, and phased their rollout gradually, they could have stopped this from spreading. If customers had been given the same ability to control these content updates as they already have for code updates, they could have prevented the automated updates from debilitating their systems all at once.

Other vendors in the EDR space not only follow these best practices in terms of rollouts but also do not interpret volatile content updates in the kernel, making the rapid deployment safer in general.

The Fix and The Tricks:

Manual recovery of each affected system had to be performed, complicated by the need for a BitLocker decryption key if the disk was encrypted. Microsoft released a recovery tool to help, while threat actors have been capitalizing on the event, luring more victims with phishing and malware.

The Repercussions:

It’s hard to imagine how much damage this faulty update did in the roughly hour and a half it was available. In less than 90 minutes, a faulty configuration file pushed to fewer than 1% of Windows machines extensively disrupted numerous industries, including airlines, healthcare, retail, and banking.

Credit: Katie Moussouris

The financial repercussions of the CrowdStrike outage are staggering. Insurer Parametrix estimated that U.S. Fortune 500 companies, excluding Microsoft, will face $5.4 billion in financial losses. The global financial losses could be even more alarming, with Parametrix’s CEO suggesting they could reach nearly $15 billion.

The Lessons:

This wasn’t the first time CrowdStrike has made this type of mistake. In April, they caused an outage for Linux customers, which obviously had fewer disastrous consequences. Also, the CrowdStrike CEO was involved in a similar boot loop crash on Windows caused by McAfee software in 2010 when he was the company's CTO.

In short, they should have known better. Now, hopefully, with the scale of this event, everyone does.

As Congress examines this outage more closely and determines how the government may be able to help prevent this from happening again at such a massive scale, all vendors in the security space that deploy rapid content updates should take the following steps:

Test all updates extensively, including using dynamic analysis, even those that do not directly contain code
Move as much execution out of kernel space as possible to minimize risk
Deploy rapid response updates in stages and have a way to roll back changes
Give customers control over anything that pushes changes to their systems, not just code changes but configuration changes like this one

While updates will still cause minor issues, if vendors commit to implementing these measures, we shouldn’t have to re-learn this very expensive and troubling lesson yet again.

----

About Luta Security:

Bugs are a symptom of underlying process failures in security. Luta Security is a company that helps organizations learn from their bugs in both technical improvements and process maturity. We take vulnerability disclosure and bug bounty programs to the next level, maximizing what you learn from each bug and getting them fixed systemically.

Re-learning Lessons from the CrowdStrike Outage

The Code and Testing Bugs:

The Rollout Process Bugs:

The Fix and The Tricks:

The Repercussions:

The Lessons:

Recent Posts

Comentarios

News & Info

Security ROI | Bug Bounty Management | Advancing Security Maturity