The Complaint
"I just listed my car 10 minutes ago. Why can't anyone see it?"
This customer was right to be frustrated. Our "real-time" marketplace had a dirty secret: new listings took up to 15 minutes to appear in search results.
Not because of any technical limitation. Just because we were running a cron job every 15 minutes.
What Is "Batch Processing"?
First, let's understand what we were doing wrong.
Batch processing means: "Collect a bunch of stuff, then process it all at once."
Think of it like a bus schedule:
BATCH PROCESSING (Bus)
You arrive at the bus stop at 10:03.
The bus comes at 10:15.
You wait 12 minutes.
Another person arrives at 10:14.
The same bus picks them up.
They wait 1 minute.
EVERYONE waits for the scheduled bus.
Some wait a lot. Some wait a little.
But everyone waits.Our system worked the same way:
OUR CRON JOB (Every 15 Minutes)
10:01 - New car uploaded → Wait 14 minutes
10:05 - New car uploaded → Wait 10 minutes
10:10 - New car uploaded → Wait 5 minutes
10:14 - New car uploaded → Wait 1 minute
10:15 - CRON RUNS → All cars processed togetherWhat Is "Event-Driven"?
Event-driven means: "Process each thing as soon as it happens."
Think of it like Uber:
EVENT-DRIVEN (Uber)
You need a ride at 10:03.
You request an Uber at 10:03.
Car arrives at 10:06.
You wait 3 minutes.
Another person needs a ride at 10:14.
They request an Uber at 10:14.
Car arrives at 10:17.
They wait 3 minutes.
EVERYONE gets individual service.
Everyone waits the same short time.The difference:
BATCH: "Wait for the bus" → Variable wait (1-15 minutes)
EVENT-DRIVEN: "Call an Uber" → Consistent wait (seconds)Why We Used Batch Processing
It wasn't a bad decision at the time. Batch processing is simpler:
BATCH PROCESSING (Simple)
┌─────────────┐
│ Cron Job │ ← One scheduled task
│ (10:15) │
└──────┬──────┘
│
▼
┌─────────────┐
│ Process │ ← One server does everything
│ All Files │
└──────┬──────┘
│
▼
┌─────────────┐
│ Done! │
└─────────────┘
- One thing to monitor
- One thing to debug
- One place where things can failBut as we grew, problems appeared:
The Problems with Batch
Problem 1: The 15-Minute Wait
This was the obvious one. Users expected "real-time." We gave them "eventually."
USER EXPECTATION:
"I click submit, my listing appears"
REALITY:
"I click submit, my listing appears... sometime in the next 15 minutes"Problem 2: The Thundering Herd
Every 15 minutes, our server went from sleeping to screaming:
CPU USAGE OVER TIME:
12:00 ─────────── (idle)
12:05 ─────────── (idle)
12:10 ─────────── (idle)
12:15 ████████████████████ CRON STARTS (100% CPU!)
12:16 ████████████████████ (still processing)
12:17 ████████████████████ (still processing)
12:18 ─────────── (done, back to idle)
12:30 ████████████████████ CRON STARTS AGAINWe had to buy a server big enough to handle the peak, even though it sat idle most of the time.
Problem 3: All-or-Nothing Failures
If the cron job crashed halfway through:
FILES TO PROCESS: 1000
PROCESSED: 1, 2, 3, 4, ... 456 💥 CRASH!
RESULT:
- 456 files processed
- 544 files stuck
- Everything waits for the next runOne bad file could delay everything.
Problem 4: No Parallelism
Our cron job processed files one at a time:
FILE 1: Process... done (2 seconds)
FILE 2: Process... done (2 seconds)
FILE 3: Process... done (2 seconds)
...
FILE 1000: Process... done (2 seconds)
TOTAL TIME: 2000 seconds = 33 minutes
But the cron runs every 15 minutes!
We're falling behind!The Event-Driven Solution
Here's how we fixed it. Instead of one big job, we made many small jobs:
BEFORE (Batch):
Files → Wait → Cron Job → Process All → Done
^
15 minutes of waiting
AFTER (Event-Driven):
File 1 uploaded → Lambda triggered → Processed → Done (5 seconds)
File 2 uploaded → Lambda triggered → Processed → Done (5 seconds)
File 3 uploaded → Lambda triggered → Processed → Done (5 seconds)
Each file processed immediately!Think of it like a restaurant:
BATCH (Cafeteria):
- Wait until 12:15
- Everyone gets food at once
- Kitchen overwhelmed, then empty
EVENT-DRIVEN (Normal Restaurant):
- Order when you're ready
- Kitchen makes your food immediately
- Steady flow of orders, steady workHow It Works
Step 1: Something Happens
A scraper uploads a new file to our S3 bucket.
SCRAPER: "Here's a new file: car-listings-001.json"
→ Uploads to S3Step 2: S3 Triggers Lambda
S3 Event Notifications automatically trigger our Lambda function:
S3: "Hey Lambda! A new file just appeared!"
{
bucket: "vehicle-uploads",
key: "raw/car-listings-001.json",
size: 25000
}Step 3: Lambda Queues the Work
Lambda doesn't do heavy processing. It just validates and queues to Amazon SQS:
LAMBDA: "Let me check this file..."
- File exists? ✓
- Not too big? ✓
- Valid format? ✓
LAMBDA: "Okay, adding to processing queue."
→ Sends message to SQSWhy not process in Lambda directly?
LAMBDA LIMITS:
- 15 minute timeout
- Limited memory
- Pay per millisecond
BETTER APPROACH:
Lambda just queues (fast, cheap)
Workers do heavy lifting (scalable)Step 4: Workers Process the Queue
Celery workers pick up messages from the queue:
WORKER 1: "I'll take message 1" → Processing...
WORKER 2: "I'll take message 2" → Processing...
WORKER 3: "I'll take message 3" → Processing...
WORKER 4: "I'll take message 4" → Processing...
All happening at the same time!Step 5: Done!
File uploaded → Searchable in 30 seconds.
Not 15 minutes. 30 seconds.
The Visual Difference
BEFORE (Batch):
Time: 10:00 10:15 10:30 10:45
│ │ │ │
▼ ▼ ▼ ▼
idle BURST idle BURST
████ ████
████ ████
████ ████
Work piles up → Explosion of activity → Quiet → Repeat
AFTER (Event-Driven):
Time: 10:00 10:15 10:30 10:45
│ │ │ │
▼ ▼ ▼ ▼
██ ██ ██ ██
██ ██ ██ ██
██ ██ ██ ██
Steady, predictable work throughout the dayError Handling Is Different
This was a big mindset shift.
Batch mindset: "If something fails, stop and investigate."
# Batch processing
for file in files:
process(file) # If this throws, everything stops!Event-driven mindset: "If something fails, retry it. Don't stop the world."
# Event-driven processing
@task(max_retries=3)
def process_file(key):
try:
# Process this one file
except Exception:
# Retry later. Other files keep processing.
retry()Visual:
BATCH (One Bad Apple):
File 1 ✓
File 2 ✓
File 3 ✗ ERROR!
File 4 → Never processed
File 5 → Never processed
File 6 → Never processed
Everything stops.
EVENT-DRIVEN (Isolated Failures):
File 1 ✓ → Done
File 2 ✓ → Done
File 3 ✗ → Retry later
File 4 ✓ → Done
File 5 ✓ → Done
File 6 ✓ → Done
Bad file retries. Others continue.The Dead Letter Queue
What about files that keep failing?
FILE PROCESSING ATTEMPTS:
Attempt 1: ✗ Failed (network timeout)
Attempt 2: ✗ Failed (network timeout)
Attempt 3: ✗ Failed (network timeout)
SYSTEM: "This file has failed 3 times."
"Moving to Dead Letter Queue."
"Alerting the ops team."The Dead Letter Queue (DLQ) is like a hospital waiting room:
NORMAL QUEUE: DEAD LETTER QUEUE:
"Healthy patients" "Patients who need special attention"
Most files go here Problem files end up here
Processed quickly Investigated by humansThe Results
METRIC BEFORE AFTER
─────────────────────────────────────────────────────
Latency (upload→search) 15 minutes 30 seconds
CPU pattern Spiky (0→100%) Steady (~40%)
Failure recovery Wait for cron Automatic retry
Parallel processing 1 at a time 50+ at once
Server cost Over-provisioned Pay-per-useThe biggest win? Users got real-time updates.
When NOT to Use Event-Driven
Event-driven isn't always better. Use batch processing when:
✓ Order matters
"Process files in exact upload order"
✓ You need transactions across items
"Either all files succeed or all fail"
✓ Volume is low and predictable
"We get 10 files per day"
✓ Simplicity is more important than speed
"This is an internal tool, latency doesn't matter"Use event-driven when:
✓ Latency matters
"Users expect real-time updates"
✓ Load is unpredictable
"Sometimes 10 files, sometimes 10,000"
✓ Failures should be isolated
"One bad file shouldn't stop everything"
✓ You need to scale horizontally
"Add more workers when needed"Key Lessons
Lesson 1: Batch Processing Is a Latency Trap
The moment you say "runs every X minutes," you've capped your responsiveness. And as load grows, that batch job becomes a time bomb.
Lesson 2: Event-Driven Is Harder But Scales Better
More moving pieces. More things to monitor. But it scales horizontally and fails gracefully.
Lesson 3: The Trigger for Migration
When you catch yourself saying "users will just have to wait for the next batch run" that's when it's time to change.
Quick Reference
When to migrate from batch to event-driven:
✗ "Listings take 15 minutes to appear"
✗ "Our server maxes out every hour"
✗ "One bad file breaks everything"
✗ "We can't keep up with volume"
If you're saying these things, consider event-driven.The basic pattern:
1. Event happens (file uploaded)
2. Trigger fires (S3 notification)
3. Work queued (SQS message)
4. Workers process (Celery tasks)
5. Failures retry automatically
6. Problem items go to DLQSummary
THE PROBLEM:
New listings took 15 minutes to appear
(Cron job ran every 15 minutes)
WHY IT HAPPENED:
Batch processing waits for scheduled runs
Like waiting for a bus vs calling an Uber
THE FIX:
Switch to event-driven architecture
Process each file as soon as it arrives
S3 → Lambda → SQS → Celery Workers
THE RESULT:
15 minutes → 30 seconds
Spiky CPU → Steady CPU
One failure stops all → Isolated failuresBatch processing is comfortable. It's predictable. It's also a trap. Real-time users deserve real-time processing.
Related Reading
- Queue Sizing and Backpressure - Designing queues that don't overflow under load
- Retry Storms - Handling failures in event-driven systems without cascading crashes
- PostgreSQL Full-Text Search Limits - When we needed faster search to match real-time ingestion
