How Did One Failed Request Turn Into 3,000? (Understanding Retry Storms)

The Explosion

We called an external API to get extra details about vehicles. The API could handle 100 requests per second. We normally made about 30.

Then the API went down for 2 minutes.

When it came back, we didn't make 30 requests per second. We made 3,000. The API went down again. And again. And again.

This is called a retry storm. And it can turn a 2-minute outage into an all-night nightmare.

What Is a Retry?

First, let's understand the basics.

A retry is when something fails, so you try again.

SIMPLE RETRY:
 
You: "Hey API, give me vehicle details"
API: "Sorry, I'm busy right now"
You: "Okay, I'll ask again"
You: "Hey API, give me vehicle details"
API: "Here you go!"

This seems harmless. And usually it is.

The problem starts when EVERYONE retries at the same time.

How a Retry Storm Happens

Let me show you the math that destroyed our weekend.

Normal operation:

Every second:
- 30 new requests come in
- 30 requests get processed
- Balance: 0 waiting
 
Life is good.

When the API goes down:

Second 0:
- 30 new requests come in
- API down → 30 requests fail
- All 30 retry immediately
 
Second 1:
- 30 NEW requests come in
- 30 RETRIES from second 0
- Total: 60 requests
- API still down → 60 fail
- All 60 retry immediately
 
Second 2:
- 30 NEW requests
- 60 RETRIES from second 1
- Total: 90 requests
- All 90 fail and retry
 
...keep going...
 
Second 60 (1 minute):
- 30 NEW requests
- 1,800 RETRIES waiting
- Total: 1,830 requests
 
Second 120 (2 minutes):
- API comes back!
- 3,600 requests hit it AT ONCE
- API can handle 100 per second
- API crashes again 💥

Visual:

NORMAL:
Requests: ━━━━━━━━━━━━━━━━━━━━━━━━ (30/sec, flat line)
 
DURING OUTAGE (with naive retries):
Requests: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
          ↗                                    (exponential
         ↗                                      growth!)
        ↗
       ↗
      ━━ (start)
 
WHEN API COMES BACK:
                          💥 BOOM!
Requests: ━━━━━━━━━━━━━━━━████████████████████
                          ↑
                          3,600 requests at once

Why "Try Again Immediately" Is Dangerous

Think of it like a traffic jam:

NORMAL TRAFFIC:
🚗 🚗 🚗 → → → 🚦 → → → (flowing nicely)
 
ACCIDENT HAPPENS (API down):
🚗 🚗 🚗 ⛔️ (cars pile up)
🚗 🚗 🚗 🚗 🚗 🚗 (more coming)
🚗 🚗 🚗 🚗 🚗 🚗 🚗 🚗 🚗 🚗 🚗 (pile grows)
 
ROAD REOPENS:
🚗🚗🚗🚗🚗🚗🚗🚗🚗🚗🚗🚗🚗🚗🚗🚗 → 🚦
↑
All cars try to go at once!
Second accident happens.

When everyone retries immediately, you create a thundering herd that can keep the service down indefinitely.

Solution 1 - Exponential Backoff

The idea: Don't retry immediately. Wait. And wait longer each time you fail. This is a well-documented pattern see AWS's guide on Exponential Backoff and Jitter.

IMMEDIATE RETRY (Bad):
Attempt 1: Failed → Retry NOW
Attempt 2: Failed → Retry NOW
Attempt 3: Failed → Retry NOW
Attempt 4: Failed → Retry NOW
 
All attempts bunched together.
 
 
EXPONENTIAL BACKOFF (Good):
Attempt 1: Failed → Wait 1 second
Attempt 2: Failed → Wait 2 seconds
Attempt 3: Failed → Wait 4 seconds
Attempt 4: Failed → Wait 8 seconds
Attempt 5: Failed → Wait 16 seconds
 
Attempts spread out over time.

Visual:

IMMEDIATE RETRY:
Time: 0    1    2    3    4    5
      ▼    ▼    ▼    ▼    ▼    ▼
      ALL ATTEMPTS BUNCHED TOGETHER
 
EXPONENTIAL BACKOFF:
Time: 0         1            3                  7
      ▼         ▼            ▼                  ▼
      Attempt1  Attempt2     Attempt3           Attempt4
               (1s wait)    (2s wait)          (4s wait)

The code (using Celery's built-in retry backoff):

@shared_task(
    max_retries=5,
    retry_backoff=True,      # Enable exponential backoff
    retry_backoff_max=600,   # Cap at 10 minutes
)
def enrich_vehicle(vehicle_id):
    data = external_api.get(vehicle_id)
    save(vehicle_id, data)

Solution 2 - Jitter (Randomness)

Exponential backoff is good, but there's still a problem:

100 REQUESTS FAIL AT THE SAME TIME:
 
With pure exponential backoff:
All 100 wait 1 second → All 100 retry together
All 100 wait 2 seconds → All 100 retry together
All 100 wait 4 seconds → All 100 retry together
 
They're SYNCHRONIZED. Still a thundering herd.

Solution: Add randomness (jitter)

With jitter:
Request 1: Wait 1.0 + random(0, 0.5) = 1.3 seconds
Request 2: Wait 1.0 + random(0, 0.5) = 1.1 seconds
Request 3: Wait 1.0 + random(0, 0.5) = 1.4 seconds
...
 
Now they're SPREAD OUT.

Visual:

WITHOUT JITTER:
Time: 0         1                   3
      │         │                   │
      ▼         ▼                   ▼
      ████      ████                ████
      (all)     (all)               (all)
 
WITH JITTER:
Time: 0    0.5   1   1.5   2   2.5   3   3.5   4
      │    │     │    │    │    │    │    │    │
      ▼    ▼     ▼    ▼    ▼    ▼    ▼    ▼    ▼
      █    █     █    █    █    █    █    █    █
      (spread out randomly)

The code:

import random
 
def retry_with_backoff(attempt):
    base_delay = 2 ** attempt          # 1, 2, 4, 8, 16...
    jitter = random.uniform(0, base_delay * 0.5)  # Add 0-50% randomness
    total_delay = base_delay + jitter
    max_delay = 300                    # Cap at 5 minutes
 
    return min(total_delay, max_delay)

Solution 3 - Circuit Breakers

Even with backoff and jitter, you're still trying when you know the service is down.

A circuit breaker says: "If things are failing, stop trying for a while." This pattern was popularized by Michael Nygard and is explained well in Martin Fowler's article on Circuit Breakers.

Think of it like an electrical circuit breaker in your house:

ELECTRICAL CIRCUIT BREAKER:
 
Normal:     Power flows → Appliances work
Overload:   Too much power → Breaker TRIPS → Power cut off
Recovery:   Wait → Reset breaker → Try again
 
This prevents your house from burning down.
 
 
SOFTWARE CIRCUIT BREAKER:
 
Normal:     Requests flow → API responds
Failures:   Too many failures → Circuit OPENS → Stop sending requests
Recovery:   Wait → Try one request → If works, resume
 
This prevents cascading failures.

The three states:

┌─────────────────────────────────────────────────────────┐
│                                                         │
│  CLOSED (Normal)                                        │
│  ──────────────                                         │
│  • Requests flow normally                               │
│  • Counting failures                                    │
│  • If failures > threshold → Go to OPEN                 │
│                                                         │
└─────────────────────────────────────────────────────────┘
           │
           │ Too many failures
           ▼
┌─────────────────────────────────────────────────────────┐
│                                                         │
│  OPEN (Protecting)                                      │
│  ────────────────                                       │
│  • ALL requests rejected immediately                    │
│  • Not even trying                                      │
│  • After timeout → Go to HALF-OPEN                      │
│                                                         │
└─────────────────────────────────────────────────────────┘
           │
           │ Timeout passed
           ▼
┌─────────────────────────────────────────────────────────┐
│                                                         │
│  HALF-OPEN (Testing)                                    │
│  ───────────────────                                    │
│  • Allow a FEW test requests through                    │
│  • If they succeed → Go to CLOSED                       │
│  • If they fail → Go back to OPEN                       │
│                                                         │
└─────────────────────────────────────────────────────────┘

Analogy:

CIRCUIT BREAKER = A SMART EMPLOYEE
 
Without circuit breaker:
Boss: "Call the supplier"
Employee: *calls* "They're not answering"
Boss: "Try again"
Employee: *calls* "Still not answering"
Boss: "Try again"
Employee: *calls 100 times* "Still nothing"
(Wasting time, annoying supplier)
 
With circuit breaker:
Boss: "Call the supplier"
Employee: *calls* "They're not answering"
Employee: *calls* "They're not answering"
Employee: "They've failed 5 times. I'm not calling for 10 minutes."
Boss: "Good idea. Try again later."
(Saving time, giving supplier space to recover)

Solution 4 - Rate Limiting Yourself

Sometimes you need to limit yourself before the external API limits you.

EXTERNAL API LIMIT: 100 requests/second
 
WITHOUT SELF-LIMITING:
You: "Here's 3,000 requests!"
API: "BLOCKED. You're banned for an hour."
 
WITH SELF-LIMITING:
You: "I'll only send 80 requests/second"
You: "That leaves headroom for retries"
API: "Thanks for being considerate"

Think of it like a speed limit:

Road speed limit: 100 mph
Safe driving speed: 80 mph (leaves room for error)
 
API rate limit: 100 req/sec
Safe request rate: 80 req/sec (leaves room for retries)

Putting It All Together

Here's how I handle external API calls now:

BEFORE MAKING A REQUEST:
 
Step 1: Check circuit breaker
        "Is the service known to be down?"
        → If yes, don't even try. Wait.
 
Step 2: Check rate limit
        "Am I sending too many requests?"
        → If yes, wait a bit.
 
Step 3: Make the request
        "Actually call the API"
 
Step 4a: If success
         → Record success
         → Circuit breaker stays healthy
 
Step 4b: If failure
         → Record failure
         → Calculate backoff with jitter
         → Schedule retry for later

Visual flow:

Request comes in
      │
      ▼
┌──────────────┐
│ Circuit open?│──► YES ──► Wait 60s, retry later
└──────────────┘
      │ NO
      ▼
┌──────────────┐
│ Rate limit   │──► YES ──► Wait 1s, retry
│ exceeded?    │
└──────────────┘
      │ NO
      ▼
┌──────────────┐
│ Make request │
└──────────────┘
      │
      ├──► SUCCESS ──► Done! Record success.
      │
      └──► FAILURE ──► Record failure
                       │
                       ▼
               Calculate backoff:
               base = 2^attempts
               jitter = random(0, base/2)
               delay = base + jitter
                       │
                       ▼
               Schedule retry for later

The Results

Before (naive retries):

API goes down for 2 minutes
→ 3,600 retries pile up
→ API comes back
→ 3,600 requests hit at once
→ API goes down again
→ Cycle repeats for hours

After (smart retries):

API goes down for 2 minutes
→ Circuit breaker opens after 10 failures
→ No more requests sent
→ After 2 minutes, circuit half-opens
→ 3 test requests succeed
→ Circuit closes
→ Normal operation resumes
→ Total extra load: minimal

Key Lessons

Lesson 1: Distributed Systems Fail Together

When an API goes down, it's not one request that fails. It's all of them. And when all of them retry at once, you've created a problem worse than the original failure.

Lesson 2: Be Patient

Immediate retries feel logical ("it just failed, try again!") but they're dangerous at scale. Waiting is actually the smart thing to do.

Lesson 3: Add Randomness

Computers are deterministic. That's usually good. But when 1,000 computers all retry at exactly the same time, determinism creates thundering herds. Randomness spreads things out.

Lesson 4: Know When to Give Up

The circuit breaker pattern sounds like "giving up." But it's actually "being smart." If something is broken, hammering it with requests makes things worse, not better.

Quick Reference

Exponential backoff:

delay = 2 ** attempt_number  # 1, 2, 4, 8, 16...

Add jitter:

delay = base_delay + random.uniform(0, base_delay * 0.5)

Circuit breaker states:

CLOSED → Normal operation
OPEN   → Stop all requests
HALF   → Test with a few requests

When to use each:

✓ Always use exponential backoff
✓ Always add jitter
✓ Use circuit breakers for external services
✓ Self-rate-limit below the API's limit

Summary

THE PROBLEM:
One 2-minute outage turned into hours of chaos
Retries piled up and overwhelmed the recovered API
 
WHY IT HAPPENED:
Naive retries: "Failed? Try again immediately!"
3,600 requests all retrying at once = thundering herd
 
THE FIX:
1. Exponential backoff (wait longer each time)
2. Jitter (spread out retries randomly)
3. Circuit breakers (stop trying when it's broken)
4. Rate limiting (don't exceed API limits)
 
THE RESULT:
2-minute outage → 2-minute recovery
Not 2-minute outage → hours of cascading failures

Retry storms happen because we optimize for the happy path. The fixes aren't complex they're just easy to forget until you're debugging a 3am outage.

Queue Sizing and Backpressure - Understanding how queues fill up and how to prevent overflow
Celery Memory Leaks - Another common cause of 3am crashes in Python workers
Batch to Event-Driven - How event-driven architectures handle failures differently

The Explosion

What Is a Retry?

How a Retry Storm Happens

Why "Try Again Immediately" Is Dangerous

Solution 1 - Exponential Backoff

Solution 2 - Jitter (Randomness)

Solution 3 - Circuit Breakers

Solution 4 - Rate Limiting Yourself

Putting It All Together

The Results

Key Lessons

Lesson 1: Distributed Systems Fail Together

Lesson 2: Be Patient

Lesson 3: Add Randomness

Lesson 4: Know When to Give Up

Quick Reference

Summary

Related Reading

Aamir Shahzad