The Explosion
We called an external API to get extra details about vehicles. The API could handle 100 requests per second. We normally made about 30.
Then the API went down for 2 minutes.
When it came back, we didn't make 30 requests per second. We made 3,000. The API went down again. And again. And again.
This is called a retry storm. And it can turn a 2-minute outage into an all-night nightmare.
What Is a Retry?
First, let's understand the basics.
A retry is when something fails, so you try again.
SIMPLE RETRY:
You: "Hey API, give me vehicle details"
API: "Sorry, I'm busy right now"
You: "Okay, I'll ask again"
You: "Hey API, give me vehicle details"
API: "Here you go!"This seems harmless. And usually it is.
The problem starts when EVERYONE retries at the same time.
How a Retry Storm Happens
Let me show you the math that destroyed our weekend.
Normal operation:
Every second:
- 30 new requests come in
- 30 requests get processed
- Balance: 0 waiting
Life is good.When the API goes down:
Second 0:
- 30 new requests come in
- API down โ 30 requests fail
- All 30 retry immediately
Second 1:
- 30 NEW requests come in
- 30 RETRIES from second 0
- Total: 60 requests
- API still down โ 60 fail
- All 60 retry immediately
Second 2:
- 30 NEW requests
- 60 RETRIES from second 1
- Total: 90 requests
- All 90 fail and retry
...keep going...
Second 60 (1 minute):
- 30 NEW requests
- 1,800 RETRIES waiting
- Total: 1,830 requests
Second 120 (2 minutes):
- API comes back!
- 3,600 requests hit it AT ONCE
- API can handle 100 per second
- API crashes again ๐ฅVisual:
NORMAL:
Requests: โโโโโโโโโโโโโโโโโโโโโโโโ (30/sec, flat line)
DURING OUTAGE (with naive retries):
Requests: โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ (exponential
โ growth!)
โ
โ
โโ (start)
WHEN API COMES BACK:
๐ฅ BOOM!
Requests: โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
3,600 requests at onceWhy "Try Again Immediately" Is Dangerous
Think of it like a traffic jam:
NORMAL TRAFFIC:
๐ ๐ ๐ โ โ โ ๐ฆ โ โ โ (flowing nicely)
ACCIDENT HAPPENS (API down):
๐ ๐ ๐ โ๏ธ (cars pile up)
๐ ๐ ๐ ๐ ๐ ๐ (more coming)
๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ (pile grows)
ROAD REOPENS:
๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐ โ ๐ฆ
โ
All cars try to go at once!
Second accident happens.When everyone retries immediately, you create a thundering herd that can keep the service down indefinitely.
Solution 1 - Exponential Backoff
The idea: Don't retry immediately. Wait. And wait longer each time you fail. This is a well-documented pattern see AWS's guide on Exponential Backoff and Jitter.
IMMEDIATE RETRY (Bad):
Attempt 1: Failed โ Retry NOW
Attempt 2: Failed โ Retry NOW
Attempt 3: Failed โ Retry NOW
Attempt 4: Failed โ Retry NOW
All attempts bunched together.
EXPONENTIAL BACKOFF (Good):
Attempt 1: Failed โ Wait 1 second
Attempt 2: Failed โ Wait 2 seconds
Attempt 3: Failed โ Wait 4 seconds
Attempt 4: Failed โ Wait 8 seconds
Attempt 5: Failed โ Wait 16 seconds
Attempts spread out over time.Visual:
IMMEDIATE RETRY:
Time: 0 1 2 3 4 5
โผ โผ โผ โผ โผ โผ
ALL ATTEMPTS BUNCHED TOGETHER
EXPONENTIAL BACKOFF:
Time: 0 1 3 7
โผ โผ โผ โผ
Attempt1 Attempt2 Attempt3 Attempt4
(1s wait) (2s wait) (4s wait)The code (using Celery's built-in retry backoff):
@shared_task(
max_retries=5,
retry_backoff=True, # Enable exponential backoff
retry_backoff_max=600, # Cap at 10 minutes
)
def enrich_vehicle(vehicle_id):
data = external_api.get(vehicle_id)
save(vehicle_id, data)Solution 2 - Jitter (Randomness)
Exponential backoff is good, but there's still a problem:
100 REQUESTS FAIL AT THE SAME TIME:
With pure exponential backoff:
All 100 wait 1 second โ All 100 retry together
All 100 wait 2 seconds โ All 100 retry together
All 100 wait 4 seconds โ All 100 retry together
They're SYNCHRONIZED. Still a thundering herd.Solution: Add randomness (jitter)
With jitter:
Request 1: Wait 1.0 + random(0, 0.5) = 1.3 seconds
Request 2: Wait 1.0 + random(0, 0.5) = 1.1 seconds
Request 3: Wait 1.0 + random(0, 0.5) = 1.4 seconds
...
Now they're SPREAD OUT.Visual:
WITHOUT JITTER:
Time: 0 1 3
โ โ โ
โผ โผ โผ
โโโโ โโโโ โโโโ
(all) (all) (all)
WITH JITTER:
Time: 0 0.5 1 1.5 2 2.5 3 3.5 4
โ โ โ โ โ โ โ โ โ
โผ โผ โผ โผ โผ โผ โผ โผ โผ
โ โ โ โ โ โ โ โ โ
(spread out randomly)The code:
import random
def retry_with_backoff(attempt):
base_delay = 2 ** attempt # 1, 2, 4, 8, 16...
jitter = random.uniform(0, base_delay * 0.5) # Add 0-50% randomness
total_delay = base_delay + jitter
max_delay = 300 # Cap at 5 minutes
return min(total_delay, max_delay)Solution 3 - Circuit Breakers
Even with backoff and jitter, you're still trying when you know the service is down.
A circuit breaker says: "If things are failing, stop trying for a while." This pattern was popularized by Michael Nygard and is explained well in Martin Fowler's article on Circuit Breakers.
Think of it like an electrical circuit breaker in your house:
ELECTRICAL CIRCUIT BREAKER:
Normal: Power flows โ Appliances work
Overload: Too much power โ Breaker TRIPS โ Power cut off
Recovery: Wait โ Reset breaker โ Try again
This prevents your house from burning down.
SOFTWARE CIRCUIT BREAKER:
Normal: Requests flow โ API responds
Failures: Too many failures โ Circuit OPENS โ Stop sending requests
Recovery: Wait โ Try one request โ If works, resume
This prevents cascading failures.The three states:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ CLOSED (Normal) โ
โ โโโโโโโโโโโโโโ โ
โ โข Requests flow normally โ
โ โข Counting failures โ
โ โข If failures > threshold โ Go to OPEN โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โ Too many failures
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ OPEN (Protecting) โ
โ โโโโโโโโโโโโโโโโ โ
โ โข ALL requests rejected immediately โ
โ โข Not even trying โ
โ โข After timeout โ Go to HALF-OPEN โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โ Timeout passed
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ HALF-OPEN (Testing) โ
โ โโโโโโโโโโโโโโโโโโโ โ
โ โข Allow a FEW test requests through โ
โ โข If they succeed โ Go to CLOSED โ
โ โข If they fail โ Go back to OPEN โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโAnalogy:
CIRCUIT BREAKER = A SMART EMPLOYEE
Without circuit breaker:
Boss: "Call the supplier"
Employee: *calls* "They're not answering"
Boss: "Try again"
Employee: *calls* "Still not answering"
Boss: "Try again"
Employee: *calls 100 times* "Still nothing"
(Wasting time, annoying supplier)
With circuit breaker:
Boss: "Call the supplier"
Employee: *calls* "They're not answering"
Employee: *calls* "They're not answering"
Employee: "They've failed 5 times. I'm not calling for 10 minutes."
Boss: "Good idea. Try again later."
(Saving time, giving supplier space to recover)Solution 4 - Rate Limiting Yourself
Sometimes you need to limit yourself before the external API limits you.
EXTERNAL API LIMIT: 100 requests/second
WITHOUT SELF-LIMITING:
You: "Here's 3,000 requests!"
API: "BLOCKED. You're banned for an hour."
WITH SELF-LIMITING:
You: "I'll only send 80 requests/second"
You: "That leaves headroom for retries"
API: "Thanks for being considerate"Think of it like a speed limit:
Road speed limit: 100 mph
Safe driving speed: 80 mph (leaves room for error)
API rate limit: 100 req/sec
Safe request rate: 80 req/sec (leaves room for retries)Putting It All Together
Here's how I handle external API calls now:
BEFORE MAKING A REQUEST:
Step 1: Check circuit breaker
"Is the service known to be down?"
โ If yes, don't even try. Wait.
Step 2: Check rate limit
"Am I sending too many requests?"
โ If yes, wait a bit.
Step 3: Make the request
"Actually call the API"
Step 4a: If success
โ Record success
โ Circuit breaker stays healthy
Step 4b: If failure
โ Record failure
โ Calculate backoff with jitter
โ Schedule retry for laterVisual flow:
Request comes in
โ
โผ
โโโโโโโโโโโโโโโโ
โ Circuit open?โโโโบ YES โโโบ Wait 60s, retry later
โโโโโโโโโโโโโโโโ
โ NO
โผ
โโโโโโโโโโโโโโโโ
โ Rate limit โโโโบ YES โโโบ Wait 1s, retry
โ exceeded? โ
โโโโโโโโโโโโโโโโ
โ NO
โผ
โโโโโโโโโโโโโโโโ
โ Make request โ
โโโโโโโโโโโโโโโโ
โ
โโโโบ SUCCESS โโโบ Done! Record success.
โ
โโโโบ FAILURE โโโบ Record failure
โ
โผ
Calculate backoff:
base = 2^attempts
jitter = random(0, base/2)
delay = base + jitter
โ
โผ
Schedule retry for laterThe Results
Before (naive retries):
API goes down for 2 minutes
โ 3,600 retries pile up
โ API comes back
โ 3,600 requests hit at once
โ API goes down again
โ Cycle repeats for hoursAfter (smart retries):
API goes down for 2 minutes
โ Circuit breaker opens after 10 failures
โ No more requests sent
โ After 2 minutes, circuit half-opens
โ 3 test requests succeed
โ Circuit closes
โ Normal operation resumes
โ Total extra load: minimalKey Lessons
Lesson 1: Distributed Systems Fail Together
When an API goes down, it's not one request that fails. It's all of them. And when all of them retry at once, you've created a problem worse than the original failure.
Lesson 2: Be Patient
Immediate retries feel logical ("it just failed, try again!") but they're dangerous at scale. Waiting is actually the smart thing to do.
Lesson 3: Add Randomness
Computers are deterministic. That's usually good. But when 1,000 computers all retry at exactly the same time, determinism creates thundering herds. Randomness spreads things out.
Lesson 4: Know When to Give Up
The circuit breaker pattern sounds like "giving up." But it's actually "being smart." If something is broken, hammering it with requests makes things worse, not better.
Quick Reference
Exponential backoff:
delay = 2 ** attempt_number # 1, 2, 4, 8, 16...Add jitter:
delay = base_delay + random.uniform(0, base_delay * 0.5)Circuit breaker states:
CLOSED โ Normal operation
OPEN โ Stop all requests
HALF โ Test with a few requestsWhen to use each:
โ Always use exponential backoff
โ Always add jitter
โ Use circuit breakers for external services
โ Self-rate-limit below the API's limitSummary
THE PROBLEM:
One 2-minute outage turned into hours of chaos
Retries piled up and overwhelmed the recovered API
WHY IT HAPPENED:
Naive retries: "Failed? Try again immediately!"
3,600 requests all retrying at once = thundering herd
THE FIX:
1. Exponential backoff (wait longer each time)
2. Jitter (spread out retries randomly)
3. Circuit breakers (stop trying when it's broken)
4. Rate limiting (don't exceed API limits)
THE RESULT:
2-minute outage โ 2-minute recovery
Not 2-minute outage โ hours of cascading failuresRetry storms happen because we optimize for the happy path. The fixes aren't complex they're just easy to forget until you're debugging a 3am outage.
Related Reading
- Queue Sizing and Backpressure - Understanding how queues fill up and how to prevent overflow
- Celery Memory Leaks - Another common cause of 3am crashes in Python workers
- Batch to Event-Driven - How event-driven architectures handle failures differently
