Understanding FailSafe
Master the fundamental concepts behind chaos engineering with FailSafe.
Experiment Lifecycle
Every FailSafe experiment progresses through four distinct phases. Understanding these phases is crucial for designing effective resilience tests.
The system operates normally while FailSafe collects performance metrics. This establishes a reference point for comparing behavior during fault injection. Duration is typically 10-30 seconds.
- Response times are measured
- Error rates are recorded
- Resource utilization is tracked
Faults are actively being injected into the system. In adaptive mode, intensity increases gradually based on system response. Metrics continue to be collected.
- Intensity starts at configured step value
- Adaptive mode adjusts based on thresholds
- Manual stop available at any time
All faults are removed and the system is monitored to ensure it returns to normal operation. This phase validates the system's ability to recover gracefully.
- Fault injection stops immediately
- Metrics compared against baseline
- Recovery time is measured
The experiment has finished. Results are compiled including resilience scores, failure points, and recommendations for improvement.
Fault Types
FailSafe supports various fault types across different platforms. Each fault type simulates specific failure scenarios.
| Fault Type | Platform | Description |
|---|---|---|
| cpu_stress | Backend | Consumes CPU cycles to simulate high load |
| memory_stress | Backend | Allocates memory to simulate memory pressure |
| kill | Backend | Terminates container processes |
| network_delay | Backend | Adds latency to network packets |
| packet_loss | Backend | Drops a percentage of network packets |
| latency | Frontend | Delays API responses |
| error | Frontend | Returns error responses |
| network | Frontend | Simulates network failures |
Intensity Model
Intensity controls the severity of fault injection on a scale of 0-100. The interpretation varies by fault type.
Step Intensity
How much intensity increases per interval in adaptive mode. Default: 10
Max Intensity
Upper limit for intensity. Injection stops when reached. Default: 100
Current Intensity
Real-time intensity level during injection phase.
CPU/Memory Stress
Percentage of resources to consume (e.g., 50 = 50% CPU)
Latency
Milliseconds of delay (e.g., 100 = 100ms added latency)
Packet Loss/Errors
Percentage of affected requests (e.g., 30 = 30% error rate)
Adaptive Testing
Adaptive mode automatically adjusts fault intensity based on system response, finding the exact point where your system begins to degrade.
When enabled, FailSafe monitors key metrics during injection and makes intelligent decisions:
- Start Low: Begin at the configured step intensity
- Monitor: Track response times, error rates, and throughput
- Increase: If metrics stay within thresholds, increase intensity by step value
- Hold: If metrics approach thresholds, maintain current intensity
- Stop: If metrics exceed thresholds or max intensity is reached, begin recovery
Default Thresholds
- Response time: >2x baseline triggers hold
- Error rate: >5% triggers hold, >20% triggers stop
- Throughput: <50% baseline triggers stop