CrowdStrike Chaos: A Wake-Up Call for Software Quality

Posted by: Alper MERMER Comments: 0 Post Date: 12 August 2024

CrowdStrike Chaos: A Wake-Up Call for Software Quality

I’m sure by this time, all of you are at least aware of the recent CrowdStrike issue on Microsoft Windows devices, which brought a significant part of our software world to a standstill. This incident not only highlighted vulnerabilities within critical update mechanisms but also underscored the importance of robust software quality processes.

Investigating the detailed report found here, let’s briefly dive into what happened and then reflect on how we can leverage these learnings to enhance our own practices. By examining the causes and consequences of this event, we can explore key strategies to fortify our systems against similar disruptions in the future.

What Happened?
Preliminary Post Incident Review: Configuration Update Impacting Falcon Sensor and Windows OS

Incident Overview CrowdStrike released a content configuration update on July 19, 2024, for the Windows sensor aimed at gathering telemetry on new threat techniques. This update, however, led to a system crash (Blue Screen of Death – BSOD) for Windows hosts running sensor version 7.11 and above, active during a specified timeframe. The issue was resolved by reverting the update shortly thereafter.

Key Details

Date & Time: July 19, 2024, 04:09 – 05:27 UTC
Affected Systems: Windows hosts running sensor version 7.11+ (Mac and Linux hosts were unaffected)
Resolution: Update reverted at 05:27 UTC

Root Cause The problem stemmed from a Rapid Response Content update, a type of dynamic update that allows quick adaptation to emerging threats. An undetected error in this update led to the system crash.

Update Delivery Mechanisms

Sensor Content: Long-term capabilities delivered with sensor releases, including on-sensor AI and machine learning models. These updates undergo extensive testing.
Rapid Response Content: Behavioral pattern-matching updates delivered dynamically. These updates include Template Instances, which configure the sensor to detect specific behaviors.

Testing and Deployment Process

Sensor Content: Includes thorough automated and manual testing, staged rollout, and customer-controlled deployment.
Rapid Response Content: Deployed through Channel Files, interpreted by the sensor’s Content Interpreter and Detection Engine. While newly released Template Types are stress tested, an error in the Content Validator allowed problematic content to pass undetected.

Specifics of the Incident

Trigger: Deployment of two new IPC (InterProcessCommunication) Template Instances on July 19, 2024. A bug in the Content Validator allowed one of these instances with problematic content data to pass validation.
Effect: Problematic content caused an out-of-bounds memory read, leading to an unhandled exception and subsequent system crash.

Preventative Measures

To prevent similar incidents, CrowdStrike will:

Enhance Testing: Introduce more comprehensive testing methods, including stress testing, fuzzing, and fault injection.
Improve Content Validator: Add checks to prevent problematic content from being deployed.
Strengthen Error Handling: Enhance the Content Interpreter’s ability to handle exceptions.
Deployment Strategy: Implement staggered deployments for Rapid Response Content, starting with canary deployments and collecting feedback before broader rollout.
Customer Control: Provide more control over update deployments and detailed release notes.
Third-Party Reviews: Conduct independent security code reviews and quality process evaluations.

Next Steps CrowdStrike will release a full Root Cause Analysis after completing the investigation, detailing the incident and the steps taken to mitigate future risks.

Lessons for the industry
Now, let’s reflect on some possible solutions to these kinds of problems. In light of the recent CrowdStrike incident, it’s crucial for us to consider and implement strategies that will prevent similar disruptions in our own systems. This involves enhancing our software quality processes to catch issues early and enable rapid resolution. By ensuring comprehensive testing and incorporating intuitive insights, we can identify and address potential problems before they impact users. Additionally, maintaining a robust and resilient system infrastructure is key to mitigating risks and ensuring stability. Seeking external evaluations can also provide unbiased assessments and new perspectives, further enhancing our software’s security and reliability. By integrating these solutions, we can significantly improve our software quality and safeguard against future incidents.

Create More Robust Continuous Delivery Pipelines, Including Heavy Alerting on Automated Test Failures

Continuous Delivery (CD) pipelines are the backbone of modern software deployment, allowing for frequent, reliable updates. Investing in robust CD pipelines is critical for maintaining software quality and ensuring seamless user experiences. By incorporating extensive alerting mechanisms for automated test failures, companies can catch and address issues early in the development cycle, reducing the risk of bugs reaching production.

Early Detection and Resolution: Robust CD pipelines with alerting systems ensure that any test failure is immediately flagged, enabling developers to address issues promptly. This reduces the time between defect introduction and resolution, thereby maintaining high code quality.
Automated Rollbacks: Advanced pipelines can automatically roll back changes that cause test failures, preventing problematic code from affecting end-users. This ensures system stability and reliability.
Increased Developer Productivity: With automated alerts, developers can focus on coding rather than manually checking for errors. This leads to higher productivity and more time spent on innovation rather than maintenance.
Improved Collaboration: Integrating alerts into communication tools fosters collaboration among teams, facilitating quick, coordinated responses. When a failure is detected, relevant team members are notified, facilitating a quick, coordinated response.

Invest More in End-to-End Automated Testing

End-to-end (E2E) automated testing is essential for verifying that software works as intended across the entire application flow. Investing in this area ensures comprehensive coverage and higher confidence in the software’s performance and functionality.

Holistic Coverage: E2E testing simulates real user scenarios, ensuring that all components of an application work together seamlessly. This type of testing can uncover integration issues that unit or functional tests might miss.
Scalability: Automated E2E tests can be run frequently and at scale, making it easier to catch regressions and new issues in large and complex applications. This scalability is crucial for supporting continuous integration and delivery practices.
Cost Efficiency: While initial setup for E2E automated testing may be high, it saves costs in the long run by reducing the need for extensive manual testing and quickly identifying issues that could become costly bugs if left unchecked.
Enhanced User Experience: By ensuring that the software behaves correctly across different use cases, E2E testing directly contributes to a better user experience. Satisfied users are more likely to remain loyal and recommend the software to others.

Invest in Testing on Real Devices, at Least a Representative Subset of Devices

Testing on real devices is crucial for understanding how software behaves in real-world conditions. Emulating environments can miss subtle differences that become apparent only on actual hardware.

Real-World Accuracy: Real device testing provides the most accurate representation of how an application will perform in users’ hands, capturing device-specific issues that emulators might miss.
Diverse Coverage: Testing on a representative subset of devices ensures that the software works across different hardware configurations, operating systems, and screen sizes, offering a more inclusive user experience.
Performance Metrics: Real devices provide accurate performance metrics, crucial to identify issues related to speed, responsiveness, and resource usage that can significantly impact user satisfaction.
Reliability and Trust: Demonstrating a commitment to thorough testing on real devices builds trust with users and stakeholders, showing that the company prioritizes delivering high-quality software.

Incorporate Human Testers for Intuitive Contributions

While automation is critical, human testers bring intuition, creativity, and a user-centric perspective that machines cannot replicate. Including humans in the testing process is essential for uncovering nuanced issues and enhancing software quality.

Intuitive Insights: Human testers can identify usability issues, ambiguous user flows, and other intuitive problems that automated tests might overlook. Their feedback is crucial for refining the user experience.
Exploratory Testing: Humans excel at exploratory testing, where they interact with the software in unscripted ways, often uncovering edge cases and unexpected behavior that scripted tests might miss.
Contextual Understanding: Human testers understand the context in which the software will be used, allowing them to test scenarios that reflect real user environments and behaviors.
Adaptive Learning: Unlike automated tests, human testers can adapt their testing strategies based on new information, making them invaluable for identifying and addressing emerging issues dynamically.

Get Third-Party, Independent Help to Ensure Different Sets of Eyes Are Looking at the Problem

Engaging third-party experts for independent reviews brings fresh perspectives and additional expertise, enhancing the overall quality and security of the software.

Unbiased Evaluation: Third-party reviewers provide an unbiased assessment of the software, identifying issues that internal teams might overlook due to familiarity or cognitive biases.
Diverse Expertise: External experts bring diverse skills and knowledge, often including specialized testing techniques and tools that can uncover hidden issues and vulnerabilities.
Enhanced Credibility: Third-party validation adds credibility to the software’s quality and security claims, which can be particularly important for regulatory compliance and building customer trust.
Continuous Improvement: Independent reviews can offer new insights and recommendations for improving internal processes, leading to ongoing enhancements in software development and testing practices.

By investing in these areas, you can significantly enhance your software quality processes, leading to more reliable, secure, and user-friendly applications. This, in turn, prevents catastrophic failures like this one, drives customer satisfaction, loyalty, and ultimately, business success.

CrowdStrike Chaos: A Wake-Up Call for Software Quality

What Happened?Preliminary Post Incident Review: Configuration Update Impacting Falcon Sensor and Windows OS

Update Delivery Mechanisms

Testing and Deployment Process

Specifics of the Incident

Preventative Measures

Create More Robust Continuous Delivery Pipelines, Including Heavy Alerting on Automated Test Failures

Invest More in End-to-End Automated Testing

Invest in Testing on Real Devices, at Least a Representative Subset of Devices

Incorporate Human Testers for Intuitive Contributions

Get Third-Party, Independent Help to Ensure Different Sets of Eyes Are Looking at the Problem

Author

What Happened?
Preliminary Post Incident Review: Configuration Update Impacting Falcon Sensor and Windows OS