Analysis: It’s Not Just CrowdStrike, It’s Microsoft’s Architecture

The recent global IT outage highlights a deeper issue within Microsoft’s architecture, particularly the reliance on kernel-level drivers. Unlike Apple, which has banned kernel-level drivers due to their propensity for causing instability and security vulnerabilities, Microsoft continues to use them extensively. This legacy approach contributes to systemic risks and vulnerabilities.

The Need for Change

  • Microsoft’s Architecture: The use of kernel-level drivers in Windows is a legacy decision that now poses significant risks. Modernising this aspect of the operating system could reduce such incidents.
  • CrowdStrike’s Process: While the fault lies in the update, CrowdStrike must reassess its update and deployment processes to prevent such widespread disruptions.

The Importance of Staged Rollouts

One critical area that CrowdStrike needs to focus on is the implementation of staged rollouts. This practice involves deploying updates gradually, starting with a small percentage of users (e.g., 1%) and then expanding to larger groups (e.g., 10%) based on the success and stability observed in the initial phases. Here’s why it’s essential:

  • Risk Mitigation: By rolling out updates to a small subset of users first, companies can identify and fix issues before they affect the entire user base.
  • Feedback Loop: Early adopters provide valuable feedback that can be used to make necessary adjustments, ensuring a more stable and reliable update for everyone.
  • Minimising Impact: Should an issue arise, only a small percentage of users are affected, reducing the overall impact and allowing for quicker resolution.

Implementing a Fallback Configuration

CrowdStrike should also consider implementing a robust fallback configuration mechanism to further safeguard against failures:

  • Starting Indicator: Store a starting indicator with the configuration version whenever the system attempts to boot with a new update.
  • Good Boot Indicator: Store a good boot indicator upon successful boot completion.
  • Fallback Logic: If multiple starting indicators are detected without corresponding good boot indicators, the system should automatically revert to the last known good configuration during the next startup.

This approach ensures that even if an update causes boot failures, the system can quickly recover by reverting to a stable configuration, minimising downtime and disruption.

Conclusion

Both Microsoft and CrowdStrike have lessons to learn from this incident. Microsoft needs to rethink its reliance on outdated technologies, and CrowdStrike must enhance its update processes by adopting staged rollouts and implementing a fallback configuration mechanism to safeguard against future issues.