The Meltdown
Our email system handled 50,000 emails per hour. Easy.
Then marketing launched a Black Friday campaign. Without telling engineering.
Suddenly we had 500,000 emails to send. Ten times our normal load.
The queue backed up. Memory spiked. RabbitMQ started dropping connections. Workers couldn't even report they were done.
The system didn't slow down gracefully. It collapsed.
What Is a Queue?
Think of a queue like a waiting line at a coffee shop:
CUSTOMERS BARISTA
↓ ↓
☕ ☕ ☕ ☕ ☕ → [WAITING LINE] → 👨🍳 → ☕ Ready!
Customers arrive.
They wait in line.
Barista makes one drink at a time.In software:
REQUESTS WORKER
↓ ↓
📧 📧 📧 📧 📧 → [QUEUE] → 🤖 → ✓ Sent!
Email requests arrive.
They wait in the queue.
Worker sends one at a time.The queue is just a waiting room. It holds work until workers are ready.
The Problem - When Lines Get Too Long
Queues work great when customers arrive at roughly the same rate workers process them.
Normal day:
ARRIVAL RATE: 100 emails/minute
PROCESSING: 100 emails/minute
QUEUE LENGTH: ~0 (no backup)
Life is good!Black Friday:
ARRIVAL RATE: 1000 emails/minute
PROCESSING: 100 emails/minute
DEFICIT: 900 emails/minute pile up
After 1 hour: 54,000 emails waiting
After 2 hours: 108,000 emails waiting
After 4 hours: 216,000 emails (SYSTEM CRASHES 💥)The queue keeps growing until something breaks.
Think of it like a restaurant:
NORMAL NIGHT:
10 customers/hour arrive
10 customers/hour seated
→ No wait
BLACK FRIDAY:
100 customers/hour arrive
10 customers/hour seated
→ Line goes out the door
→ Around the block
→ People start leaving (or worse, passing out)What Happens When a Queue Gets Too Big
When the queue fills up, bad things happen:
STAGE 1: Queue Growing
─────────────────────────────
Queue: ████░░░░░░ (40% full)
Status: "Hmm, getting busy"
STAGE 2: Queue Large
─────────────────────────────
Queue: ████████░░ (80% full)
Status: "Workers can't keep up"
Memory: Climbing
STAGE 3: Queue Critical
─────────────────────────────
Queue: ██████████ (100% full)
Status: "No room for new messages"
Memory: Almost maxed
STAGE 4: Collapse
─────────────────────────────
Queue: 💥💥💥💥💥
Status: "RabbitMQ out of memory"
"Workers can't connect"
"New messages rejected"
"Everything stops"The First Question - What Should Happen When It's Full?
This is the key decision. When the queue is full, you have three options:
Option 1: Drop Old Messages (First In, First Out)
QUEUE FULL:
[Old] [Old] [Old] [Old] [Old] → NEW MESSAGE ARRIVES
ACTION: Drop oldest message
[Dropped] [Old] [Old] [Old] [New]
"Sorry, that message waited too long. It's gone."Good for: Notifications, status updates, things that expire anyway
Bad for: Important emails, financial transactions
Option 2: Reject New Messages
QUEUE FULL:
[Old] [Old] [Old] [Old] [Old] → NEW MESSAGE ARRIVES
ACTION: Reject the new message
NEW MESSAGE: "Sorry, queue is full. Try later."
"Come back when we have room."Good for: When you want to tell senders to slow down
Bad for: Fire-and-forget systems
Option 3: Move to Overflow (Dead Letter Queue)
QUEUE FULL:
[Old] [Old] [Old] [Old] [Old] → NEW MESSAGE ARRIVES
ACTION: Send to overflow queue for later
MAIN QUEUE: [Old] [Old] [Old] [Old] [Old]
OVERFLOW: [New]
"We'll get to you, just not right now."Good for: When you can't lose messages but can delay them
The Better Solution - Don't Let It Get Full
The real fix isn't "what to do when full." It's "how to prevent it from getting full."
This is called backpressure. It means telling senders to slow down.
Think of it like a bouncer at a club:
WITHOUT BACKPRESSURE:
Club capacity: 100 people
People trying to enter: 1000
Result: Chaos, crushing, fire hazard
WITH BACKPRESSURE (Bouncer):
Club capacity: 100 people
People trying to enter: 1000
Bouncer: "Sorry, we're at capacity. Line forms here."
Result: Orderly line, people wait their turnIn code, backpressure means:
def send_email(email_data):
# Check queue depth first
current_depth = get_queue_depth()
if current_depth > 50000:
# DON'T add more to the queue
# Either:
# 1. Return an error
# 2. Wait until there's room
# 3. Save for later
raise BackpressureError("Queue full, try again later")
# Queue has room - proceed
queue.send(email_data)Priority - Not All Messages Are Equal
Here's a key insight: When overloaded, focus on what matters most.
EMAIL TYPES:
CRITICAL: Password resets, security alerts
→ Never drop. Ever.
HIGH: Order confirmations, receipts
→ Should send, but can wait
NORMAL: Marketing campaigns
→ Nice to send, okay to drop some
LOW: Newsletters, digests
→ Can wait hours, can be droppedThe strategy:
NORMAL DAY:
Critical: ██ (handled immediately)
High: ████ (handled quickly)
Normal: ████████ (bulk of traffic)
Low: ██ (background)
OVERLOADED (Black Friday):
Critical: ██ (STILL handled immediately)
High: ████ (still priority)
Normal: ████░░░░ (some dropped)
Low: ░░ (mostly dropped)
Critical emails ALWAYS go through.
Low priority emails can wait or be dropped.Think of it like a hospital emergency room:
NORMAL DAY:
Heart attack patient → Immediate attention
Broken arm → Wait 30 minutes
Cold symptoms → Wait 2 hours
MASS CASUALTY EVENT:
Heart attack patient → STILL immediate attention
Broken arm → Wait longer
Cold symptoms → "Please come back tomorrow"
You don't ignore heart attacks because there's a crowd.The Visual Flow
Here's how I handle incoming email requests now:
EMAIL REQUEST ARRIVES
│
▼
┌─────────────────────────────┐
│ What's the priority? │
└─────────────────────────────┘
│
┌────┴────┬────────┬────────┐
▼ ▼ ▼ ▼
CRITICAL HIGH NORMAL LOW
│ │ │ │
▼ ▼ ▼ ▼
┌───────┐ ┌──────┐ ┌──────┐ ┌──────┐
│Queue │ │Queue │ │Queue │ │Queue │
│Limit: │ │Limit:│ │Limit:│ │Limit:│
│Unlim. │ │50000 │ │100K │ │10000 │
└───────┘ └──────┘ └──────┘ └──────┘
│ │ │ │
▼ ▼ ▼ ▼
Always Usually Maybe Often
Sent Sent Sent DroppedRate Limiting Yourself
Another way to prevent overload: Control how fast you add to the queue.
WITHOUT RATE LIMITING:
Marketing: "Send 500,000 emails NOW!"
System: *tries to queue 500,000 at once*
System: 💥
WITH RATE LIMITING:
Marketing: "Send 500,000 emails NOW!"
System: "I'll add 100 per second to the queue"
System: "That's 5,000 per minute"
System: "All 500,000 will be queued in ~100 minutes"
System: ✓ (Still running smoothly)Think of it like filling a bathtub:
WITHOUT RATE LIMITING:
You: *turns faucet to maximum*
Bathtub: *overflows*
WITH RATE LIMITING:
You: *turns faucet to medium*
Bathtub: *fills slowly but doesn't overflow*Dead Letter Queues - The Safety Net
What happens to messages that can't be delivered?
They go to a Dead Letter Queue (DLQ) - a special place for problem messages. See RabbitMQ's documentation on Dead Letter Exchanges for the implementation details.
NORMAL MESSAGE:
Request → Queue → Worker → ✓ Done!
PROBLEM MESSAGE:
Request → Queue → Worker → ✗ Failed
│
▼
Dead Letter Queue
│
▼
"We'll investigate later"The DLQ is like a lost and found:
NORMAL PACKAGES:
Mail arrives → Delivered to address → Done!
PROBLEM PACKAGES:
Mail arrives → Wrong address!
│
▼
Lost and Found
│
▼
"Someone will figure this out"What goes to the DLQ:
- Messages that failed too many times
- Messages that sat in queue too long (expired)
- Messages rejected because queue was full
Monitoring - Know Before It Breaks
You need to see the queue filling up before it crashes.
QUEUE HEALTH DASHBOARD:
emails.critical: ██░░░░░░░░ (20%) ✓ Healthy
emails.high: ████░░░░░░ (40%) ✓ Healthy
emails.normal: ████████░░ (80%) ⚠️ Warning!
emails.low: ██████████ (100%) 🚨 FULL!
ALERTS:
⚠️ emails.normal at 80% capacity for 5 minutes
🚨 emails.low queue full, messages being droppedWhat to alert on:
GREEN (0-60%): No action needed
YELLOW (60-80%): "Getting busy, watching it"
ORANGE (80-95%): "Add more workers or reduce load"
RED (95-100%): "Messages being dropped!"The Key Insight
Here's what I learned from that Black Friday disaster:
Queues don't solve capacity problems. They defer them.
WITHOUT QUEUE:
1000 requests → System handles 100 → 900 immediately fail
WITH QUEUE:
1000 requests → Queue holds 900 → ...then what?
If you can't process them EVENTUALLY, they still fail.
Just later.A queue is temporary storage, not magic. If producers consistently outpace consumers, the queue will fill up. Every time.
The real solutions are:
- More consumers (process faster)
- Backpressure (slow down producers)
- Load shedding (drop low-priority work)
- Rate limiting (spread the load over time)
Key Lessons
Lesson 1: Design for Overload First
Don't wait until Black Friday to think about capacity. Plan for 10x traffic on day one.
Lesson 2: Not All Messages Are Equal
Critical messages should never be dropped. Low-priority messages can wait or be dropped when needed.
Lesson 3: Backpressure Is Your Friend
Telling producers to slow down is better than crashing when they don't.
Lesson 4: Monitor Before You Crash
If you can see the queue filling up, you can react before it's too late.
Quick Reference
Queue strategies:
When full, what happens to new messages?
drop-head: Drop oldest (for expiring data)
reject-publish: Reject new (for backpressure)
dead-letter: Move to DLQ (for investigation)Priority levels:
CRITICAL: Never drop, unlimited queue
HIGH: Rarely drop, generous queue
NORMAL: Can drop when overloaded
LOW: Aggressive dropping okayWarning signs:
⚠️ Queue depth steadily increasing
⚠️ Memory usage climbing
⚠️ Workers falling behind
🚨 Messages being rejected
🚨 Dead letter queue growingSummary
THE PROBLEM:
Marketing launched a 10x campaign without warning
Queue filled up, system crashed
WHY IT HAPPENED:
- No queue limits (grew until out of memory)
- No backpressure (kept accepting work)
- No priority (marketing emails treated same as password resets)
- No monitoring (didn't see it coming)
THE FIX:
1. Set queue size limits
2. Implement backpressure (slow down when busy)
3. Use priorities (critical emails always go through)
4. Add dead letter queues (don't lose messages)
5. Monitor queue depth (alert before full)
THE RESULT:
10x traffic → Graceful degradation
Critical emails always sent
Low-priority emails delayed (not lost)
System stays runningThe question isn't "what's the right queue size?" It's "what happens when you exceed it?" Design for that, and the queue size takes care of itself.
Related Reading
- Retry Storms - When failed requests multiply and overwhelm your queues
- Batch to Event-Driven - How event-driven architecture spreads load more evenly
- Celery Memory Leaks - When your queue workers consume memory and crash
