Justifying spend on quality engineering and software testing is a constant struggle for technology-first organizations. The risks are significant, but the value is often misunderstood — perhaps more than misunderstood. In fact, it is often prioritized lower than producing more code or features. As leaders and veterans in the software quality world, we see this firsthand every day, and we know that not everything can or should be tested. We know that the process of quality assurance is not infallible, and defects will escape. So, what is an IT organization to do? We have long sought the right balance between risk and spend when it comes to quality. It is why QA Consultants has invested so much in building processes and helping our customers focus on engineering higher quality software.
On July 19, 2024, the massive CrowdStrike defect caused a system crash that affected over 8 million computers running Windows operating systems and their software (TechRadar, 2024), resulting in major complications throughout various healthcare, banking, and travel systems, among other organizations, at an estimated financial impact of $10 billion (ISG, 2024). At this time, investigations are continuing, but it is understood that the incident occurred due to a defect in an update specific to integrations. Many details will need to be investigated to confirm if system integration testing wasn’t adequately completed before the release was automatically delivered to customers worldwide (ISG, 2024). It is highly possible that the testing that was done was what had been previously determined as the right balance between risk and cost. In fact, that balance always works well…until it doesn’t (like in this case). Hindsight is always 20/20, and the many after-action reports on this incident may point to known and unknown decisions, incompetence, lack of clarity, etc. Culpability and accountability will be large, as there are many cooks in this “kitchen.”
What can we learn from this situation? The time, effort, and money spent on quality assurance far outweigh the loss incurred with a catastrophic incident. But this is made clear only after the incident took place. Prior to the incident, the lack of spend on this particular part of QA might be heralded as fiscal responsibility and efficient software development practices. It takes an incident to highlight the impact of those trade-offs. As part of ALTEN Technology, QA Consultants plays a significant role in customer organizations to reduce the impacts of integration issues, including risk versus reward prioritization decisions, like the CrowdStrike situation.
Here are 5 ways organizations can better position themselves in the future:
1. Implement Comprehensive Risk-Based Testing Processes:
Prioritize testing efforts based on the potential impact of specific features or changes. This approach involves mapping the critical areas of the system that would be most affected by updates and focusing more intensive testing on these high-risk areas. Risk-based testing helps ensure that the most crucial parts of the system are thoroughly validated, reducing the likelihood of critical failures.
- Example: For critical security updates, conduct an in-depth risk assessment to identify key components that interact with the update. Perform extensive tests on these components, including regression tests, stress tests, and integration tests, to ensure their stability and security. Less critical areas can receive lighter testing to optimize resources and time.
2. Adopt a Staggered Deployment Strategy Based on Impact:
Roll out updates in phases, starting with regions or customers with lower impact and gradually progressing to larger, more critical customers only after ensuring stability.
- Example: Deploy updates first to internal systems or smaller customers with less critical operations. Monitor the performance and stability of these initial deployments closely. Once the update is confirmed to be stable, proceed to roll out updates to larger customers or those whose business operations are more critical and would cause a greater impact if issues arise. This approach helps minimize potential disruptions and ensures that any problems are identified and resolved before affecting major customers.
3. Enhance Monitoring and Feedback Mechanisms with Automated Rollback and Update Pausing:
Implement robust monitoring and logging systems to provide real-time feedback on updates and integrate automated rollback procedures. Additionally, ensure that updates can be paused in other regions until further investigation is completed if issues are detected.
- Example: Utilize tools that provide real-time alerts and dashboards to track the health of systems receiving updates. If anomalies or issues are detected, the automated rollback system should immediately revert the affected systems to the previous stable version. Simultaneously, pause the rollout of updates to other regions to prevent further impact until a thorough investigation is conducted, and the issue is resolved. This comprehensive approach ensures quick identification, mitigation of issues, and control over the update distribution process.
4. Strengthen Cross-Functional Collaboration with System Integration Focus:
In a heavily integrated environment, ensure that testing strategies include broad system integration testing, not just local validations. Collaboration between development, QA, and operations teams (DevOps practices) should be emphasized to understand and address the dependencies and interactions between different applications and the operating system. This approach helps identify potential issues that could arise from the integration points and ensure the stability of the entire system.
- Example: Conduct end-to-end integration testing that includes all components interacting with the update, such as middleware, databases, and third-party services. Regular cross-functional reviews and joint testing sessions should be held to align on integration impacts and validation criteria. This comprehensive testing strategy helps detect and resolve issues that could affect the broader system, ensuring seamless operation and reducing the risk of widespread failures.
5. Increase Customer Control and Enhance QA Diligence:
Provide customers with greater control over when updates are applied, allowing them to schedule updates during times of lower business impact. Additionally, encourage customers to implement their own QA strategies to validate changes within their IT landscape before updates are promoted to their production environment.
- Example: Develop features that allow customers to define specific dates and times for updates, ensuring minimal disruption to their operations. This can be particularly useful for businesses that experience peak operational periods. Additionally, customers should invest in their own QA strategy and perform their own testing and validation of updates in staging environments. This extra layer of diligence helps identify any potential issues within the context of their unique IT setup before the updates are deployed to production.
6. BONUS Tip: Always have a “production beta”:
Whether it is termed Beta (waterfall), or canary (DevOps), always have a small production group for deployments and analysis before rolling to the larger/global group. In fact, this very item is one that CrowdStrike has realized would have provided an air brake to prevent the worldwide disaster from occurring.
Proper preventative testing can save significant time, money, and business losses when seamlessly integrated. Proactive and concise testing is necessary to keep instances like Windows update failure at bay and keep your business running smoothly. Quality assurance experts like QA Consultants play an essential role in successful integration testing from the start. Don’t wait until a global outage causes flight delays, important healthcare appointments to be missed, or significant financial repercussions.