Core Concepts

Understanding FailSafe

Master the fundamental concepts behind chaos engineering with FailSafe.

Experiment Lifecycle

Every FailSafe experiment progresses through four distinct phases. Understanding these phases is crucial for designing effective resilience tests.

baseline

Baseline Collection

The system operates normally while FailSafe collects performance metrics. This establishes a reference point for comparing behavior during fault injection. Duration is typically 10-30 seconds.

Response times are measured
Error rates are recorded
Resource utilization is tracked

injecting

Fault Injection

Faults are actively being injected into the system. In adaptive mode, intensity increases gradually based on system response. Metrics continue to be collected.

Intensity starts at configured step value
Adaptive mode adjusts based on thresholds
Manual stop available at any time

recovering

Recovery Period

All faults are removed and the system is monitored to ensure it returns to normal operation. This phase validates the system's ability to recover gracefully.

Fault injection stops immediately
Metrics compared against baseline
Recovery time is measured

completed

Completion

The experiment has finished. Results are compiled including resilience scores, failure points, and recommendations for improvement.

Fault Types

FailSafe supports various fault types across different platforms. Each fault type simulates specific failure scenarios.

Fault Type	Platform	Description
cpu_stress	Backend	Consumes CPU cycles to simulate high load
memory_stress	Backend	Allocates memory to simulate memory pressure
kill	Backend	Terminates container processes
network_delay	Backend	Adds latency to network packets
packet_loss	Backend	Drops a percentage of network packets
latency	Frontend	Delays API responses
error	Frontend	Returns error responses
network	Frontend	Simulates network failures

Intensity Model

Intensity controls the severity of fault injection on a scale of 0-100. The interpretation varies by fault type.

Configuration Parameters

Step Intensity

How much intensity increases per interval in adaptive mode. Default: 10

Max Intensity

Upper limit for intensity. Injection stops when reached. Default: 100

Current Intensity

Real-time intensity level during injection phase.

Intensity Meanings

CPU/Memory Stress

Percentage of resources to consume (e.g., 50 = 50% CPU)

Latency

Milliseconds of delay (e.g., 100 = 100ms added latency)

Packet Loss/Errors

Percentage of affected requests (e.g., 30 = 30% error rate)

Adaptive Testing

Adaptive mode automatically adjusts fault intensity based on system response, finding the exact point where your system begins to degrade.

How Adaptive Mode Works

When enabled, FailSafe monitors key metrics during injection and makes intelligent decisions:

Start Low: Begin at the configured step intensity
Monitor: Track response times, error rates, and throughput
Increase: If metrics stay within thresholds, increase intensity by step value
Hold: If metrics approach thresholds, maintain current intensity
Stop: If metrics exceed thresholds or max intensity is reached, begin recovery

Default Thresholds

Response time: >2x baseline triggers hold
Error rate: >5% triggers hold, >20% triggers stop
Throughput: <50% baseline triggers stop