Prepare for the next IT disaster now

Technology / opinion

Security vendor Crowdstrike releases preliminary report on what caused last week's IT disaster; it's a reminder that our IT is and will remain fallible

25th Jul 24, 3:11pm by Juha Saarinen

Meta AI's rendition of a broken hammer

The post mortem for the massive "Crowdstruck" IT meltdown last Friday is out. Even though it's a preliminary post incident report, and there might be more to come, Crowdstrike has published heaps of detail on what caused eight million to nine million Windows computers, physical and virtual, to crash and not restart properly.

Understanding the dense technical detail of the report isn't easy, but the security vendor said it was "problematic content" in an update that caused the crashes.

Developer Nic Wise, suggested to think of it as being a person with access to a bunch of deposit boxes in a bank. Each box has something in it that the you need to do your job, and there are many of them, each with their own unique number.

Some of the boxes are yours, some are not, and all are guarded by an aggressive, armed bank manager who will shoot you if you open the wrong ones. In a computer, the bank manager is the low-level part of the operating system, the code that controls the hardware, provides access to files and folders on storage, handles networking and more.

This is called a kernel in computerese. As you can imagine, if the kernel malfunctions and crashes, your computer goes down with it. You do not want that to happen.

Next, you have a clipboard with a sequence of boxes that you're meant to open, and look inside for stuff your coworkers have put inside - data and instructions.

In the case of Crowdstrike, the clipboard says to open box 345. This is your box, so the aggressive bank manager doesn't shoot you when opening it. Even though your coworkers never put anything inside that box, you open it and do whatever it says on the piece of paper in it. This could be instructions from 12 months ago, from someone else's list and they're no longer correct or valid.

That's the error Crowdstrike said took place thanks to the problematic content in the update: an out-of-bounds memory read error. You access data you're entitled to see, but it's likely to be complete rubbish. For that reason, OOB (out of bounds) reads are something that operating systems try to prevent, by terminating misbehaving applications.

Normally that's fine, but as Nic said, if it's the bank, that is the operating system kernel, reading the bad data, it could lead to a situation where it is terminated.

When that happens, there's the Windows Blue Screen of Death (BSoD). Nic and I both remember how Windows used to just freeze with a BSoD, with data loss as the result.

Although Crowdstrike is blaming nobody but itself for the bug, its sensor threat detection software runs with very high privileges on computers. This gives it unfettered access to all parts of the computers it protects, which means it should be able to find malicious code hiding on them.

However, it's also vital for Crowdstrike to make sure the frequent updates to detect threats it sends out are always correct. If not, something bad can happen to the bank manager (the kernel) and we'll have worldwide chaos with BSoD screens to stare at.

Apple decided a few years ago it was just too unsafe to have third-party developers poke at things, deep in the operating system. It moved away from the so-called kernel extensions or KEXTs that had been part of macOS for ages, and restricted access to sensitive system areas.

Doing that without breaking some existing software, often popular things like DropBox and Microsoft OneDrive cloud storage, can be difficult as developers need time to figure out how the new design works. But, it makes crashes due to buggy low-level code from third party developers much less likely. It also improves system security as malicious code is harder to hide deep in the operating system.

Before anyone chimes in, no, this is not fail safe or a perfect solution to crashing computers and security woes, but I'm trying to remember the last time the Apple devices I use froze or conked out, and I can't.

That's even with pre-release software that I run to check out new features coming up.

Kernel mode is *the* problem. In 2024 changing software from third parties via a private update channel is about the highest risk setup and should not be a generally available capability. And if it is it should not be used in critical systems. https://t.co/m10r5dLl8X
— Steven Sinofsky (@stevesi) July 19, 2024

Sinofsky has some idea what he's talking about, being a Microsoft Windows veteran. Microsoft is no doubt aware of the problem, and it has devised mitigations to stop buggy code from taking down the Windows kernel. The "Crowdstruck" incident should be the excuse Microsoft needs to embark on a deeper re-engineering of the operating system, no matter that it'll break some software from third party developers.

The longer story short here is: it's a risky to put all your eggs in the same basket. By now, after countless IT related fiascos over the years, it should be clear to everyone that writing bug free code isn't possible, along with engineering flawless hardware.

If you go through the Crowdstrike report, it's clear that the security vendor tests the code it ships thoroughly. Even then, it only took one mistake that mysteriously enough wasn't caught early on (it should've been - see below), and a very extensive and expensive disaster struck.

Btw the report mentions testing over 20 times, but for those who didn’t catch it - none of the channel tests involve actually putting the updates on a CrowdStrike system.

One of the follow up actions listed is to test updates on dev systems.

Zero real testing in reality.
— Kevin Beaumont (@GossiTheDog) July 24, 2024

It's possible to mitigate against many such errors, the ones that been observed in the past and developers and engineers have had time to think about, but it's complicated stuff to say the least, for humans in particular. Nevertheless, there will always be something coming along that nobody thought about, often a combination of factors that isn't obvious.

Operating critical infrastructure on diverse systems may seem at first a way to mitigate the above, but the cost implications are unpalatable and the complexity of it brings risks by itself. Also, you can't introduce diverse, redundant infrastructure everywhere in a world with interoperating systems, many of which that any given organisation doesn't control.

There will be lots of "expert" commentary weighing in on how to fix this as always, but a bit like Heidegger's Hammer: IT works great until it doesn't and the tool locks us out of its world of usefulness which today is enormous.

Ironically enough, IT having become more reliable over the years has lulled us into a false sense of security. Nobody expects things to break. They do though, so we shouldn't be surprised at the consequences being serious, but be ready instead.

We welcome your comments below. If you are not already registered, please register to comment.

Remember we welcome robust, respectful and insightful debate. We don't welcome abusive or defamatory comments and will de-register those repeatedly making such comments. Our current comment policy is here.

14 Comments

by mark_a | 25th Jul 24, 3:20pm 1721877612

Keeping EU do gooders out would be a good start.

Seen the arguments about too much concentration - feels its the latter - too many cooks in the kernel kitchen…..

by Stephen06 | 25th Jul 24, 6:15pm 1721888117

...The longer story short here is: it's a risky to put all your eggs in the same basket...

And that basket is called the cloud- biggest con pulled by the IT sector ever.

Stick to your own server people, no one can interfere then!

by Kohukohunui | 25th Jul 24, 7:14pm 1721891694

Stick to your own server people, no one can interfere then!

Sure... you mean with an airgap right?

Otherwise it's a meaningless difference, because it was the same result if you installed Crowdstrike on your "own" server. In fact I'd say this was a huge proportion of the servers impacted, because there's much less reason for this sort of junk agent on ephemeral+immutable VMs spun up in a cloud.

by Roger the dodger | 26th Jul 24, 7:13am 1721934837

Yeah nah.

by nktokyo | 28th Jul 24, 10:35am 1722119745

Disagree. The cloud means I can have email, CRM, phone etc without needing a HW room, patch panel, software admin etc. It's a massive cost savings. If there's an outage every now and then I will wear it.

by Kohukohunui | 25th Jul 24, 9:24pm 1721899445

So much written on this topic without understanding the heart of the issue.

Two easy lessons.

1. Don't run a rootkit from a 3rd party vendor on your essential sevices with realtime updates out of your control, especially not from a vendor that sells tick-box compliance to executives over expensive dinners. This might be acceptable on your company laptops with sticky keys out in the field (they won't take down your banking transactions or airlines), but not on servers running essential shared services. This type of external change is exactly what these servers need to be hardened against, instead you've created a huge vector for instability and attack.

2. Don't put middle/upper managers in charge of your tech who will be seduced by snake-oil "security" companies selling tick-box compliance software under the guise of "security", which create the illusion you can solve or outsource security with a tool.

If you do (2), then (1) takes care of itself. As evidence, note that most of the organisations impacted were traditional, non-tech/, compliance heavy organisations ( transport orgs like ports/airlines, banks, local govt, insurance etc ).

Meanwhile, you could happily stream youtube/netflix and shop on Amazon/Apple/Uber Eats: put bluntly, these companies don't have idiots installing 3rd party crap on their servers, and aren't suckers for anti-virus FUD.

by Averageman | 26th Jul 24, 12:29pm 1721953752

Wow. If only windows was no so weak. Should it be able to ignore something looking for a memory reference that is not there, or should it just blue screen in default response. You may as well claim everyone should just run Linux and be done with it.

by Kohukohunui | 26th Jul 24, 6:06pm 1721973999

Same type of issue could have impacted their Linux agent too, and has in fact done so in the past. See https://access.redhat.com/solutions/7068083 .

by pacifica | 25th Jul 24, 8:05pm 1721894757

If you knew the CEO you would never have signed up for Crowdstrike as they have a track record including of no actual knowledge of basic testing, QA or deployment procedures leading to massive failure. That report proves it quite well (e.g. the content validator reliance; seriously wtf!?), that even cowboys and small one man IT companies have better policies deploying to their own home networks.

Which raises an interesting question: what income range do you have to be to face no consequences and just fail up to other Csuite roles? I know we did it in NZ for Stephen Town and many others in our government depts, and companies but it seems clear there is a big divide between minor issues causing job losses & need to transfer out of an industry below a certain level and yet absolutely catastrophic failures leading to business as usual or golden handshakes to even higher pay roles for those above a certain pay rate.

by Kate | 25th Jul 24, 9:08pm 1721898493

I think the big be prepared lesson is - always keep cash on hand.

by SimonRo | 26th Jul 24, 4:38pm 1721968714

From what I read the bizarrely missing piece in CrowdStrike's approach would be staged rollout of their patches. Deliver the patch to 1000 customers for 48 hours first and wait for the all clear before you let your mistakes spread out to the world. Many others do this.

by Kohukohunui | 26th Jul 24, 6:04pm 1721973844

Sure, but if you're running a bank or airline, you can't just cross your fingers and hope your vendor is doing this. When they don't, you're responsible, not them.

by pacifica | 27th Jul 24, 7:26am 1722021968

You could say the same thing that they all used Microsoft Windows OS and have automatic updates enabled... this is a big no in any company working with critical information & services and yet Windows makes it very difficult to manage updates without an update failure hitting systems en masse and even more difficult to process rollbacks without substantial backups and complete reinstalls.

If their solution is reinstall and pray there are good backups well sadly most companies are not fully prepared to do that on any given work day and backup restoration is often completely untested in most orgs.

This last update corrupted certificates and had many other update corruptions that affected a multitude of systems & functions. Yet it was forced, automatic restart without any chance to prevent the update occurring during work hours (it literally just shutdown the device & systems). Hence you really need to break the out of the box functionality to prevent updates that are set to be forced, automatic, without any update scheduling functions or settings that are actually followed adequately (Windows often just ignores many and you have to really go in and disable things so updates fail to start instead to prevent them occurring during critical times) and without adequate testing.

by pacifica | 27th Jul 24, 7:15am 1722021310

Even for companies or individuals managing a small number of devices and customers staged deployment is standard practice, so common it is mind blowing Crowdstrike did not know how to do it even on a regional basis.