FailSafe Docs
Core Concepts

Understanding FailSafe

Master the fundamental concepts behind chaos engineering with FailSafe.

Experiment Lifecycle

Every FailSafe experiment progresses through four distinct phases. Understanding these phases is crucial for designing effective resilience tests.

baseline
Baseline Collection

The system operates normally while FailSafe collects performance metrics. This establishes a reference point for comparing behavior during fault injection. Duration is typically 10-30 seconds.

  • Response times are measured
  • Error rates are recorded
  • Resource utilization is tracked
injecting
Fault Injection

Faults are actively being injected into the system. In adaptive mode, intensity increases gradually based on system response. Metrics continue to be collected.

  • Intensity starts at configured step value
  • Adaptive mode adjusts based on thresholds
  • Manual stop available at any time
recovering
Recovery Period

All faults are removed and the system is monitored to ensure it returns to normal operation. This phase validates the system's ability to recover gracefully.

  • Fault injection stops immediately
  • Metrics compared against baseline
  • Recovery time is measured
completed
Completion

The experiment has finished. Results are compiled including resilience scores, failure points, and recommendations for improvement.

Fault Types

FailSafe supports various fault types across different platforms. Each fault type simulates specific failure scenarios.

Fault TypePlatformDescription
cpu_stressBackendConsumes CPU cycles to simulate high load
memory_stressBackendAllocates memory to simulate memory pressure
killBackendTerminates container processes
network_delayBackendAdds latency to network packets
packet_lossBackendDrops a percentage of network packets
latencyFrontendDelays API responses
errorFrontendReturns error responses
networkFrontendSimulates network failures

Intensity Model

Intensity controls the severity of fault injection on a scale of 0-100. The interpretation varies by fault type.

Configuration Parameters

Step Intensity

How much intensity increases per interval in adaptive mode. Default: 10

Max Intensity

Upper limit for intensity. Injection stops when reached. Default: 100

Current Intensity

Real-time intensity level during injection phase.

Intensity Meanings

CPU/Memory Stress

Percentage of resources to consume (e.g., 50 = 50% CPU)

Latency

Milliseconds of delay (e.g., 100 = 100ms added latency)

Packet Loss/Errors

Percentage of affected requests (e.g., 30 = 30% error rate)

Adaptive Testing

Adaptive mode automatically adjusts fault intensity based on system response, finding the exact point where your system begins to degrade.

How Adaptive Mode Works

When enabled, FailSafe monitors key metrics during injection and makes intelligent decisions:

  1. Start Low: Begin at the configured step intensity
  2. Monitor: Track response times, error rates, and throughput
  3. Increase: If metrics stay within thresholds, increase intensity by step value
  4. Hold: If metrics approach thresholds, maintain current intensity
  5. Stop: If metrics exceed thresholds or max intensity is reached, begin recovery

Default Thresholds

  • Response time: >2x baseline triggers hold
  • Error rate: >5% triggers hold, >20% triggers stop
  • Throughput: <50% baseline triggers stop