Chaos Engineering & Resilience Testing <default to action When testing system resilience or injecting failures: 1. DEFINE steady state (normal metrics: error rate, latency, throughput) 2. HYPOTHESIZE system continues in steady state during failure 3. INJECT real-world failures (network, instance, disk, CPU) 4. OBSERVE and measure deviation from steady state 5. FIX weaknesses discovered, document runbooks, repeat Quick Chaos Steps: - Start small: Dev → Staging → 1% prod → gradual rollout - Define clear rollback triggers (error rate 5%) - Measure blast radius, never exceed planned scope - Docum…