Alper MERMER, Author at Testinium https://testinium.com/author/alper-mermer/ Thu, 22 Aug 2024 22:27:39 +0000 en-GB hourly 1 https://wordpress.org/?v=5.9 https://testinium.com/wp-content/uploads/2021/02/cropped-logo-512x512-1-32x32.png Alper MERMER, Author at Testinium https://testinium.com/author/alper-mermer/ 32 32 CrowdStrike Chaos: A Wake-Up Call for Software Quality https://testinium.com/blog/crowdstrike-chaos-a-wake-up-call-for-software-quality/ Mon, 12 Aug 2024 18:56:04 +0000 https://testinium.com/?p=9877

I’m sure by this time, all of you are at least aware of the recent CrowdStrike issue on Microsoft Windows devices, which brought a significant part of our software world to a standstill. This incident not only highlighted vulnerabilities within critical update mechanisms but also underscored the importance of robust software quality processes. 

Investigating the detailed report found here, let’s briefly dive into what happened and then reflect on how we can leverage these learnings to enhance our own practices. By examining the causes and consequences of this event, we can explore key strategies to fortify our systems against similar disruptions in the future.

What Happened?
Preliminary Post Incident Review: Configuration Update Impacting Falcon Sensor and Windows OS

Incident Overview CrowdStrike released a content configuration update on July 19, 2024, for the Windows sensor aimed at gathering telemetry on new threat techniques. This update, however, led to a system crash (Blue Screen of Death – BSOD) for Windows hosts running sensor version 7.11 and above, active during a specified timeframe. The issue was resolved by reverting the update shortly thereafter.

Key Details

  • Date & Time: July 19, 2024, 04:09 – 05:27 UTC

  • Affected Systems: Windows hosts running sensor version 7.11+ (Mac and Linux hosts were unaffected)

  • Resolution: Update reverted at 05:27 UTC

Root Cause The problem stemmed from a Rapid Response Content update, a type of dynamic update that allows quick adaptation to emerging threats. An undetected error in this update led to the system crash.

Update Delivery Mechanisms

  • Sensor Content: Long-term capabilities delivered with sensor releases, including on-sensor AI and machine learning models. These updates undergo extensive testing.

  • Rapid Response Content: Behavioral pattern-matching updates delivered dynamically. These updates include Template Instances, which configure the sensor to detect specific behaviors.

Testing and Deployment Process

  1. Sensor Content: Includes thorough automated and manual testing, staged rollout, and customer-controlled deployment.

  2. Rapid Response Content: Deployed through Channel Files, interpreted by the sensor’s Content Interpreter and Detection Engine. While newly released Template Types are stress tested, an error in the Content Validator allowed problematic content to pass undetected.

Specifics of the Incident

  • Trigger: Deployment of two new IPC (InterProcessCommunication) Template Instances on July 19, 2024. A bug in the Content Validator allowed one of these instances with problematic content data to pass validation.

  • Effect: Problematic content caused an out-of-bounds memory read, leading to an unhandled exception and subsequent system crash.

Preventative Measures 

To prevent similar incidents, CrowdStrike will:

  1. Enhance Testing: Introduce more comprehensive testing methods, including stress testing, fuzzing, and fault injection.

  2. Improve Content Validator: Add checks to prevent problematic content from being deployed.

  3. Strengthen Error Handling: Enhance the Content Interpreter’s ability to handle exceptions.

  4. Deployment Strategy: Implement staggered deployments for Rapid Response Content, starting with canary deployments and collecting feedback before broader rollout.

  5. Customer Control: Provide more control over update deployments and detailed release notes.

  6. Third-Party Reviews: Conduct independent security code reviews and quality process evaluations.

Next Steps CrowdStrike will release a full Root Cause Analysis after completing the investigation, detailing the incident and the steps taken to mitigate future risks.

Lessons for the industry
Now, let’s reflect on some possible solutions to these kinds of problems. In light of the recent CrowdStrike incident, it’s crucial for us to consider and implement strategies that will prevent similar disruptions in our own systems. This involves enhancing our software quality processes to catch issues early and enable rapid resolution. By ensuring comprehensive testing and incorporating intuitive insights, we can identify and address potential problems before they impact users. Additionally, maintaining a robust and resilient system infrastructure is key to mitigating risks and ensuring stability. Seeking external evaluations can also provide unbiased assessments and new perspectives, further enhancing our software’s security and reliability. By integrating these solutions, we can significantly improve our software quality and safeguard against future incidents.

Create More Robust Continuous Delivery Pipelines, Including Heavy Alerting on Automated Test Failures

Continuous Delivery (CD) pipelines are the backbone of modern software deployment, allowing for frequent, reliable updates. Investing in robust CD pipelines is critical for maintaining software quality and ensuring seamless user experiences. By incorporating extensive alerting mechanisms for automated test failures, companies can catch and address issues early in the development cycle, reducing the risk of bugs reaching production.

  1. Early Detection and Resolution: Robust CD pipelines with alerting systems ensure that any test failure is immediately flagged, enabling developers to address issues promptly. This reduces the time between defect introduction and resolution, thereby maintaining high code quality.

  2. Automated Rollbacks: Advanced pipelines can automatically roll back changes that cause test failures, preventing problematic code from affecting end-users. This ensures system stability and reliability.

  3. Increased Developer Productivity: With automated alerts, developers can focus on coding rather than manually checking for errors. This leads to higher productivity and more time spent on innovation rather than maintenance.

  4. Improved Collaboration: Integrating alerts into communication tools fosters collaboration among teams, facilitating quick, coordinated responses. When a failure is detected, relevant team members are notified, facilitating a quick, coordinated response.

Invest More in End-to-End Automated Testing

End-to-end (E2E) automated testing is essential for verifying that software works as intended across the entire application flow. Investing in this area ensures comprehensive coverage and higher confidence in the software’s performance and functionality.

  1. Holistic Coverage: E2E testing simulates real user scenarios, ensuring that all components of an application work together seamlessly. This type of testing can uncover integration issues that unit or functional tests might miss.

  2. Scalability: Automated E2E tests can be run frequently and at scale, making it easier to catch regressions and new issues in large and complex applications. This scalability is crucial for supporting continuous integration and delivery practices.

  3. Cost Efficiency: While initial setup for E2E automated testing may be high, it saves costs in the long run by reducing the need for extensive manual testing and quickly identifying issues that could become costly bugs if left unchecked.

  4. Enhanced User Experience: By ensuring that the software behaves correctly across different use cases, E2E testing directly contributes to a better user experience. Satisfied users are more likely to remain loyal and recommend the software to others.

Invest in Testing on Real Devices, at Least a Representative Subset of Devices

Testing on real devices is crucial for understanding how software behaves in real-world conditions. Emulating environments can miss subtle differences that become apparent only on actual hardware.

  1. Real-World Accuracy: Real device testing provides the most accurate representation of how an application will perform in users’ hands, capturing device-specific issues that emulators might miss.

  2. Diverse Coverage: Testing on a representative subset of devices ensures that the software works across different hardware configurations, operating systems, and screen sizes, offering a more inclusive user experience.

  3. Performance Metrics: Real devices provide accurate performance metrics, crucial to identify issues related to speed, responsiveness, and resource usage that can significantly impact user satisfaction.

  4. Reliability and Trust: Demonstrating a commitment to thorough testing on real devices builds trust with users and stakeholders, showing that the company prioritizes delivering high-quality software.

Incorporate Human Testers for Intuitive Contributions

While automation is critical, human testers bring intuition, creativity, and a user-centric perspective that machines cannot replicate. Including humans in the testing process is essential for uncovering nuanced issues and enhancing software quality.

  1. Intuitive Insights: Human testers can identify usability issues, ambiguous user flows, and other intuitive problems that automated tests might overlook. Their feedback is crucial for refining the user experience.

  2. Exploratory Testing: Humans excel at exploratory testing, where they interact with the software in unscripted ways, often uncovering edge cases and unexpected behavior that scripted tests might miss.

  3. Contextual Understanding: Human testers understand the context in which the software will be used, allowing them to test scenarios that reflect real user environments and behaviors.

  4. Adaptive Learning: Unlike automated tests, human testers can adapt their testing strategies based on new information, making them invaluable for identifying and addressing emerging issues dynamically.

Get Third-Party, Independent Help to Ensure Different Sets of Eyes Are Looking at the Problem

Engaging third-party experts for independent reviews brings fresh perspectives and additional expertise, enhancing the overall quality and security of the software.

  1. Unbiased Evaluation: Third-party reviewers provide an unbiased assessment of the software, identifying issues that internal teams might overlook due to familiarity or cognitive biases.

  2. Diverse Expertise: External experts bring diverse skills and knowledge, often including specialized testing techniques and tools that can uncover hidden issues and vulnerabilities.

  3. Enhanced Credibility: Third-party validation adds credibility to the software’s quality and security claims, which can be particularly important for regulatory compliance and building customer trust.

  4. Continuous Improvement: Independent reviews can offer new insights and recommendations for improving internal processes, leading to ongoing enhancements in software development and testing practices.

By investing in these areas, you can significantly enhance your software quality processes, leading to more reliable, secure, and user-friendly applications. This, in turn, prevents catastrophic failures like this one, drives customer satisfaction, loyalty, and ultimately, business success.

]]>
When Greed Trumps Quality: Part-2, Post Office https://testinium.com/blog/when-greed-trumps-quality-part-2-post-office/ Tue, 21 May 2024 22:21:34 +0000 https://testinium.com/?p=9836

Hello again, everyone. In my previous article, we explored how compromising quality can lead to problems through the Boeing example. In this follow-up piece, just like before, we will talk about a critical incident. This time, our example is quite striking as it involves not just a single event, but a nearly 20-year-long quality issue. After two decades of development, testing, deployment, and user experience processes, the Post Office, one of the UK’s most respected institutions, had to partially acknowledge its mistakes. Consequently, it faced legal battles and had to pay significant compensations to the employees who suffered from these errors.

For those unaware of the incident, here’s a brief summary: Fujitsu made an agreement with the Post Office to implement a software called Horizon. This software, widely used across the post office for daily operations, unfortunately exhibited many quality flaws leading to errors. These errors were either overlooked or not properly resolved when detected. What happened to the postal workers using this software? They ended up dealing with unjust lawsuits and compensations; some lost their lives, jobs, or savings, causing a significant uproar across the country.

Let’s examine what happened in this situation chronologically together.

  • 1999: The Horizon IT system begins rollout in UK Post Office branches.

  • 2000: Alan Bates reports issues with the Horizon system.

  • 2003: Bates’ contract is terminated after disputing liability for account shortfalls in Llandudno, North Wales.

  • 2004: Lee Castleton faces a £25,000 shortfall in Bridlington, East Yorkshire, and is bankrupt after losing a legal battle with the Post Office.

  • 2009: “Computer Weekly” exposes the subpostmasters’ fight for justice; the Justice for Subpostmasters Alliance forms.

  • 2010: Seema Misra, a pregnant subpostmaster in West Byfleet, Surrey, is jailed over a £74,000 theft accusation.

  • 2015: Post Office head Paula Vennells denies wrongful convictions to a business committee; the Post Office stops subpostmaster prosecutions.

  • 2017: 555 subpostmasters initiate legal action against the Post Office.

  • 2019: The High Court finds Horizon software flawed, contributing to account shortfalls. The Post Office agrees to a £58 million settlement with the 555 subpostmasters; Vennells is awarded a CBE.

  • 2020: The Post Office does not contest 44 subpostmasters’ appeals.

  • 2021: An inquiry into Horizon’s failings begins; 39 crown court convictions are overturned.

  • 2023: The government offers £600,000 in compensation to each wrongly convicted subpostmaster.

The Post Office scandal, primarily involving the flawed Horizon IT system, showcases significant errors in software development, management, and oversight that led to wrongful convictions and severe personal consequences for many subpostmasters. Here are the key mistakes related to software quality that facilitated this scandal:

  1. Inadequate Testing and Quality Assurance: The Horizon system was rolled out without thorough testing to ensure its reliability and accuracy in processing transactions. This lack of rigorous testing led to undetected bugs and errors that caused discrepancies in accounting data.

  2. Ignoring User Reports and Feedback: Subpostmasters began reporting issues soon after the system’s deployment. However, these reports were largely dismissed by the Post Office and Fujitsu, the system developer. The failure to address and investigate these reports allowed the software problems to persist and escalate.

  3. Deficient Error Handling and System Monitoring: The system lacked robust mechanisms for detecting and correcting errors autonomously. There was also inadequate monitoring to promptly identify and address issues as they arose, which is crucial in a system handling financial transactions.

  4. Lack of Transparency and Accountability: The Post Office did not provide clear and accessible channels for users to report problems, nor was there transparency in how reported issues were handled. This opacity prevented stakeholders from understanding the scope and impact of the problems.

  5. Insufficient User Training and Support: Subpostmasters received inadequate training on the new system. This lack of support, coupled with the system’s complexity, made it difficult for users to identify whether issues were due to user errors or system flaws.

  6. Resistance to External Scrutiny: The Post Office resisted external scrutiny and maintained a defensive stance even when presented with evidence of the system’s failures. This resistance to external review and the initial refusal to halt prosecutions contributed to the perpetuation of the issue.

Sounds incredible, doesn’t it? As you read from top to bottom, it becomes apparent not only how software errors can be overlooked, but also how, when these errors are discovered or the system fails, managers and accountable individuals might engage in various behaviors to protect their own interests.

At this point, it becomes clear that quality is something much greater than just testing. It is crucial for an institution, a project, a team, indeed every product, to be easily traceable, manageable, and most importantly, accountable.

The fundamental goal in all our software processes is to produce not just what is required, but software that generates value and benefits. From this perspective, seeing quality merely as a test, a product output, or the result of a phase is a significant error. Here, enhancing quality and ensuring that not only the final product but also the process achieves the highest level of quality is something professionals like us, the whole team, should aspire to do. From the beginning of the process— the ideation phase, through the analysis and requirements determination stages, to the development process, and afterward, monitoring in the real environment where the product is used—we impact not just our product and the profits we earn, but also the lives of countless people touched by our products. With this responsibility in mind, I wish you numerous successful endeavors as you invest rightly in quality and quality processes, benefiting not only yourselves but also your customers and humanity at large.

Further reading on The Post Office scandal and the timeline of events read more here and here.

]]>
When Greed Trumps Quality: Part-1, Boeing https://testinium.com/blog/when-greed-trumps-quality-part-1-boeing/ Mon, 13 May 2024 14:41:28 +0000 https://testinium.com/?p=9791

Hello everyone! Today, let’s delve into the pivotal Boeing incident: a cautionary tale where the repercussions of neglecting quality echo loud and clear. This will be a brilliant testament to the quote: “Quality isn’t just a checkbox; it’s the foundation of trust and reputation.”

Boeing has recently dominated discussions, serving as a reminder of the dire consequences when quality takes a back seat, which has been very popularly discussed in recent days.

I contemplated starting this article by exploring the shifts in human psyche over the past century. The rise of social media has led to the relentless pursuit of popularity and wealth. People have shifted to measure their worth based on how much they are liked, and how much wealth they accumulated. These societal dynamics even subtly impact our software quality processes. However I’ve saved these deep topics for another day. Instead, let’s focus on the Boeing saga – an eye-opening narrative that underscores the importance of prioritising quality.

Firstly, for those who are not familiar with the Boeing incident, I would like to briefly summarise it. Boeing is one of the two most important companies that come to mind, especially when it comes to commercial aircraft production. This company, especially with its model named 737 MAX, has experienced significant quality issues. These problems, of course, brought to light the quality issues experienced by other models, and cast a shadow over the company’s reputation. As the company grapples with the fallout, let’s take a brief journey through the timeline of these tumultuous events.

  • October 29, 2018: Lion Air Flight 610 crashes after takeoff from Jakarta, killing 189.

  • January 30, 2019: Boeing announces record earnings, surpassing $100 billion.

  • March 10, 2019: Ethiopian Airlines Flight 302 crashes after takeoff from Addis Ababa, killing 157.

  • March 11-15, 2019: China and subsequently other countries, including the US, ground the 737 Max.

  • April 4, 2019: Boeing admits the MCAS contributed to crashes.

  • July 24, 2019: Boeing reports a $3.7 billion quarterly loss.

  • December 20-23, 2019: Boeing’s Starliner fails to reach the ISS; CEO Dennis Muilenburg is fired.

  • January 2020: Boeing halts 737 Max production; internal communications reveal safety doubts.

  • March 4, 2020: United and JetBlue cut flights as COVID-19 impacts air travel.

  • May 27, 2020: Boeing announces layoffs of 7,000 workers.

  • August 28, 2020: FAA briefly grounds eight 787 Dreamliners over manufacturing concerns.

  • November 18, 2020: FAA ends the 20-month grounding of the 737 Max.

  • 2021-2024: Ongoing issues include tequila bottles found in Air Force One jets, manufacturing shortcuts, and additional FAA audits and fines.

So, what happened? The causes of all these events, the mistakes or missteps that led to them, can be subject to many comments. There are many shortcuts in quality that I could mention, but to briefly summarise, let’s briefly discuss some of the main reasons that both the company itself and other researchers have focused on.

  1. MCAS Design: The Manoeuvring Characteristics Augmentation System (MCAS) was implicated in two fatal crashes due to its reliance on a single sensor and lack of adequate pilot training and disclosure.

  2. Internal Communications: Leaked internal communications revealed that employees had doubts about the 737 Max’s safety, describing inadequate oversight in its design and supervision.

  3. Self-Regulation and Conflict of Interest: The FAA’s decision to delegate airplane certification to Boeing employees represents a significant conflict of interest, undermining quality control. This self-regulation allowed Boeing to potentially prioritise speed over thorough safety checks, compromising the integrity of their quality assurance processes.

  4. Manufacturing Shortcuts: Reports surfaced of non-standard manufacturing processes and other shortcuts that affected quality and safety, leading to additional FAA audits and investigations. Many of these manufacturing issues stemmed from inadequate quality control measures with subcontractors, emphasising speed and delivery over quality and safety. 

  5. Regulatory Oversight: The FAA’s delayed response in grounding the 737 Max raised questions about Boeing’s influence over the certification and oversight process.

How ironic for a company whose slogan is ‘if it is not Boeing, I’m not going’! Now, let’s all look together at how this negative example can be related to our actual topic, which is software quality.

When we look at the list, we can see that it consists of topics that are not unfamiliar in the software industry either. Deploying features that are not ready due to the pressure of ‘we must launch immediately, we must deliver immediately’. Not providing adequate training about these features. Treating quality specialists as if they are unnecessary and not including them in the process, and even using absurd slogans frequently found in agile software development such as ‘Everyone should be a developer’ which implies there’s no need for a dedicated tester. There are some very familiar reasons here, such as self-regulation and the belief that the team does not need external oversight or oversight from another role. We could talk about these for hours and make many elaborations, but the message I want to convey is probably clear by now. If we see ourselves as unrivalled, if we believe that any mistake we make will not harm us in any way and will not cause any loss of reputation among our users, then as Gerald Weinberg also says in his books, we don’t need to test. You might choose not to test, you might not care about quality. But like any real-world company with such concerns, you can prevent these kinds of disasters by addressing quality from start to finish with real experts, and by involving quality specialists at every point of software development. 

Just because there has never been a fire in your house doesn’t mean you don’t need a smoke detector. The amount you stand to lose when a fire does occur is immense. Like these companies, taking shortcuts, prioritising money and especially greed will harm your prestige and your company’s stance, and will definitely be detrimental to you and your company in the medium and long term. So let’s settle these quality processes together so you, your employees, and your customers can wholeheartedly say, ‘Yes, this product is ours and we are proud of it’. Sending you all my love and respect. See you in my next post.

“Quality isn’t just a checkbox; it’s the foundation of trust and reputation.”

For a much lighthearted touch on things, please watch this

And also a great in detail review of the overall situation here and here.

]]>