In July 2024, a few days after what many called the largest IT outage in history, I wrote a short post about it. I want to come back to that post, because the full account that has emerged since has changed how I read what happened.

At the time, the post read like an appreciation note. I called CrowdStrike’s response “swift and transparent,” and spent most of it on a point I still partly believe: that we only notice infrastructure when it breaks, and the people keeping it running rarely get the credit. What I did not have then was the root-cause analysis, the congressional testimony, the lawsuits, or the year of platform changes that followed. With those in hand, the story is less about resilient systems holding the line and more about a plain failure of release discipline. That is the more useful thing to write about, and it is the part my first take got wrong.

What actually broke

CrowdStrike sells security software called Falcon. Its working part is the sensor — not a piece of hardware, despite the name, but a software agent that installs on each machine and runs deep inside Windows, in the kernel: the privileged core of the operating system, where code can watch everything, but where a crash takes down the whole machine rather than a single program. To keep up with new attack techniques without shipping a whole new sensor each time, Falcon pulls down small configuration files — CrowdStrike calls them Rapid Response Content — that tell the sensor already on the machine what to look for. In February 2024 the company added a capability of this kind aimed at a particular Windows mechanism (named pipes, one of the ways programs talk to each other), governed by a configuration called Channel File 291. Its first content shipped in March and behaved as expected through the spring.

On the morning of 19 July 2024, at 04:09 UTC, a new Rapid Response update to Channel File 291 went out to Windows hosts. Per CrowdStrike’s own root-cause analysis, the sensor on those machines expected twenty input fields, while this update referenced a twenty-first. When the sensor’s interpreter went to read that twenty-first value, it was reading past the end of an array that only held twenty — an out-of-bounds memory read, a read of memory that was never there to be read. In kernel code there is no quiet recovery from that: the read crashed the sensor, and the sensor took Windows down with it, into the blue screen of death — the full-stop crash screen — on repeat.

The twenty-first field had never been used in production before this update, so the case that would have exposed it had never run. CrowdStrike’s content checker, the Validator, did not catch the mismatch; a separate runtime bounds check that might have contained it was absent. Both gaps were documented and fixed afterwards. The company reverted the bad file at 05:27 UTC, about seventy-eight minutes later, but reversion does not un-crash a machine that already pulled the file — recovery meant touching devices, often by hand, often booting into safe mode to delete a single file.

The blast radius is the part worth sitting with. The update reached roughly 8.5 million Windows devices — well under one percent of all Windows machines, but concentrated in exactly the places that cannot absorb it: airlines grounded flights, hospitals postponed procedures, banks and broadcasters and payment systems stalled. A configuration file with a counting error became a global event because of where the code that read it was allowed to sit.

A timeline

The dates matter, because the public understanding of this incident arrived in stages: the crash first, the cause weeks later, and the accountability and platform changes across the year that followed.

  1. February 2024
    CrowdStrike ships the new sensor capability behind Channel File 291.
  2. 5 March 2024
    The first Rapid Response Content for it goes to production after a stress test; later spring updates behave normally.
  3. 19 July 2024, 04:09 UTC
    The faulty update ships; the twenty-versus-twenty-one mismatch triggers an out-of-bounds read and mass crashes.
  4. 19 July 2024, 05:27 UTC
    CrowdStrike reverts the file, about seventy-eight minutes later. Roughly 8.5 million devices are affected.
  5. 22 July 2024
    CrowdStrike introduces automated techniques to speed up remediation.
  6. 24 July 2024
    A preliminary review is published, in which the company acknowledges it had over-relied on past success.
  7. 29 July 2024
    About 99% of Windows sensors are back online.
  8. 6 August 2024
    CrowdStrike publishes the full twelve-page root-cause analysis.
  9. 24 September 2024
    CrowdStrike testifies before a US House Homeland Security subcommittee.
  10. 25 October 2024
    Delta Air Lines sues CrowdStrike over roughly $500 million in losses; CrowdStrike countersues.
  11. May 2025
    A Georgia judge lets most of Delta’s suit proceed, including gross-negligence claims.
  12. 2025
    Microsoft begins moving third-party security software out of the Windows kernel.

Where my first read was too generous

“Swift and transparent” was not wrong, exactly. CrowdStrike did revert quickly, and it did publish an unusually detailed analysis. But framing the response as the story let the company off the hook for the cause, and the cause was not bad luck. It was a content update that reached millions of production machines at once, with no staged rollout to catch it on a few first, a validation step that tested content but missed this class of error, and no safe fallback once the machines were down.

CrowdStrike said as much itself. Testifying before Congress that September, the company’s representative told the subcommittee that its checks had tested content files individually but not together, and that the new process would test updates internally before sending them to anyone. He also offered a line that has stuck with me:

Trust takes years to make and seconds to break.
The written testimony is on the record, and it reads less like an accident report than a description of process that had been trusted because it had always worked before.

The legal read sharpened the point. When Delta sued, its argument was not that software has bugs — everyone’s does — but that this was a breakdown of basic practice: no testing, no staging, no rollback. In May 2025 a judge found that argument credible enough to let most of the case, including gross negligence, go forward. You can disagree with Delta about damages and still notice that a court looked at the same facts I had waved past and saw a discipline problem, not an act of God.

The year after

Two things changed that make the incident more than a bad day. The first is accountability that outlived the news cycle: the congressional hearing, the still-running Delta litigation, and an industry that now reaches for this outage as the standard example of concentration risk — one vendor, one file, a meaningful slice of the world’s computers.

The second is structural, and it is the most interesting outcome. The reason a sensor crash became a Windows crash is that the sensor lived in the kernel, where a fault has nowhere to fall but the whole machine. Following the outage, Microsoft started building a way for security vendors to run outside the kernel, in the same user space as ordinary applications, where a crash stays contained. It also began requiring partners to ship updates gradually, through deployment rings, with monitoring — the staged rollout whose absence caused this. The platform is being reshaped around the lesson.

What it actually teaches

Strip away the scale and the failure is ordinary, which is exactly why it is worth keeping. Ship a change to everything at once and a small mistake becomes a large one. Test the parts but never the whole, and the gap between them is where the outage lives. Trust a process because it has never failed, and you have stopped checking whether it still works. None of this is specific to kernel drivers or to security software; it is true of anything that updates itself on other people’s machines, and the cost stays invisible right up until the moment it is enormous.

So this is the correction to my July 2024 post. The instinct in it was not wrong — reliable systems really are the work of people you never hear about, and that is worth saying. But the appreciation belongs to the discipline that prevents these events, not to the speed of the apology after one. A year of record turned an appreciation note into something more precise, and I would rather be precise.


Primary sources

Everything above traces back to documents you can read yourself. The dates and figures are from these: