From BSOD to EDR: A Simple Explanation of the Recent IT Meltdown

By now, almost everyone has heard about CrowdStrike and the global IT outage caused by a buggy update to one of their software components. This incident, which affected businesses and organizations worldwide, has been a hot topic in tech news and beyond. But for those who aren’t tech-savvy, understanding what exactly happened can be challenging.

Let’s break down this complex event, explaining the key players and repeatedly used acronyms in simple terms to help you understand the widespread impact without getting lost in technical jargon.

Who is CrowdStrike?

    CrowdStrike Holdings, Inc. is an American cybersecurity technology company based in Austin, Texas. The company provides antivirus software to Microsoft for its Windows devices, and many industries globally—from banking to retail to healthcare—use the company’s software to protect against breaches and hackers.

    • What is EDR?

    Think of End Point Detection & Response (EDR) as a vigilant security guard for your computer, constantly watching for suspicious activity. In this case, Falcon is the EDR software provided by CrowdStrike that organizations install on their computers to keep them safe from cyber-attacks and malware.

    • What is BSOD?

    The Blue Screen of Death, or BSOD, is a critical error screen on Windows PCs that halts all operations and displays an error message. This occurs when the system encounters a severe issue, often leading to an unexpected restart and potential data loss.

    • What Caused the Outage?

    Imagine trying to install a new lock on your front door, but the lock is faulty, and now you can’t open your door at all. That’s similar to what happened with the Falcon update. New ransomware and attack patterns emerge daily, necessitating frequent EDR updates. The outage was caused by a defect in a content update to Falcon software for Windows hosts.

    In simple terms, the development team at CrowdStrike created a new update for Falcon, presumably tested it (though not robustly), and pushed it to user machines. Because it contained faulty code, it caused the machines to crash. But the question remains: why were users’ computers unable to recover on subsequent restarts, and why was the BSOD persistent?

    This is because Falcon operates at the computer’s kernel level, meaning it is one of the first pieces of code that executes when the computer starts. Since the update contained faulty code, it kept crashing the machines, preventing the restart process from completing successfully.

    • Is it a Cyber Attack?

    CrowdStrike stated that the incident was not caused by a cyberattack. It appears to be due to poor change or release management practices. It looks like the code was developed but did not go through a robust change management process, which involves testing in development and pre-production environments. Typically, such updates are released to a small subset of users before a wider rollout.

    • What is the Solution?

    The fix involves booting the system into Safe Mode or the Windows Recovery Environment and deleting the file titled “C-00000291*.sys.”

    While this may sound simple, it’s quite complex, especially for large organizations. Imagine a company with hundreds of users, many working remotely—accessing each machine physically to delete the culprit file becomes a logistical nightmare. This fix is intended for experts and IT professionals, not regular users.

    • Potential Security Implications

    The U.S. Cybersecurity and Infrastructure Security Agency (CISA) has alerted the public that cybercriminals are exploiting the recent outage to conduct phishing attacks and other forms of malicious activity.

    Threat actors continue to use the widespread IT outage for phishing and other malicious activity. CISA urges organizations to ensure they have robust cybersecurity measures to protect their users, assets, and data against this activity.

    Closing Thoughts

    This incident serves as a stark reminder of our increasing dependence on technology and the far-reaching consequences of even minor errors in critical systems. It underscores the need for rigorous testing protocols, robust backup systems, and comprehensive disaster recovery plans. Organizations should reassess and revise their existing plans based on insights from CrowdStrike and Microsoft regarding the root cause, sequence of events, and their commitments to prevent future occurrences. Additionally, companies may want to review their contracts with software vendors to clarify liability in light of such incidents.

    For the second and third lines of defence, change and software release management processes require appropriate oversight and regular, thorough evaluation. All changes in the production environment should undergo rigorous testing, with results documented and maintained. This comprehensive approach will help build more resilient systems and mitigate the impact of future incidents on both businesses and individuals.

    Additional Resources

    This entry was posted in IT Incidents and tagged , , , . Bookmark the permalink.

    Leave a Reply

    Your email address will not be published. Required fields are marked *