Why Were New Listings Taking 15 Minutes to Appear? (From Batch to Event-Driven)

The Complaint

"I just listed my car 10 minutes ago. Why can't anyone see it?"

This customer was right to be frustrated. Our "real-time" marketplace had a dirty secret: new listings took up to 15 minutes to appear in search results.

Not because of any technical limitation. Just because we were running a cron job every 15 minutes.

What Is "Batch Processing"?

First, let's understand what we were doing wrong.

Batch processing means: "Collect a bunch of stuff, then process it all at once."

Think of it like a bus schedule:

BATCH PROCESSING (Bus)
 
You arrive at the bus stop at 10:03.
The bus comes at 10:15.
You wait 12 minutes.
 
Another person arrives at 10:14.
The same bus picks them up.
They wait 1 minute.
 
EVERYONE waits for the scheduled bus.
Some wait a lot. Some wait a little.
But everyone waits.

Our system worked the same way:

OUR CRON JOB (Every 15 Minutes)
 
10:01 - New car uploaded  → Wait 14 minutes
10:05 - New car uploaded  → Wait 10 minutes
10:10 - New car uploaded  → Wait 5 minutes
10:14 - New car uploaded  → Wait 1 minute
10:15 - CRON RUNS         → All cars processed together

What Is "Event-Driven"?

Event-driven means: "Process each thing as soon as it happens."

Think of it like Uber:

EVENT-DRIVEN (Uber)
 
You need a ride at 10:03.
You request an Uber at 10:03.
Car arrives at 10:06.
You wait 3 minutes.
 
Another person needs a ride at 10:14.
They request an Uber at 10:14.
Car arrives at 10:17.
They wait 3 minutes.
 
EVERYONE gets individual service.
Everyone waits the same short time.

The difference:

BATCH:        "Wait for the bus"  →  Variable wait (1-15 minutes)
EVENT-DRIVEN: "Call an Uber"     →  Consistent wait (seconds)

Why We Used Batch Processing

It wasn't a bad decision at the time. Batch processing is simpler:

BATCH PROCESSING (Simple)
 
┌─────────────┐
│  Cron Job   │  ← One scheduled task
│  (10:15)    │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  Process    │  ← One server does everything
│  All Files  │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│   Done!     │
└─────────────┘
 
- One thing to monitor
- One thing to debug
- One place where things can fail

But as we grew, problems appeared:

The Problems with Batch

Problem 1: The 15-Minute Wait

This was the obvious one. Users expected "real-time." We gave them "eventually."

USER EXPECTATION:
"I click submit, my listing appears"
 
REALITY:
"I click submit, my listing appears... sometime in the next 15 minutes"

Problem 2: The Thundering Herd

Every 15 minutes, our server went from sleeping to screaming:

CPU USAGE OVER TIME:
 
12:00 ─────────── (idle)
12:05 ─────────── (idle)
12:10 ─────────── (idle)
12:15 ████████████████████ CRON STARTS (100% CPU!)
12:16 ████████████████████ (still processing)
12:17 ████████████████████ (still processing)
12:18 ─────────── (done, back to idle)
12:30 ████████████████████ CRON STARTS AGAIN

We had to buy a server big enough to handle the peak, even though it sat idle most of the time.

Problem 3: All-or-Nothing Failures

If the cron job crashed halfway through:

FILES TO PROCESS: 1000
 
PROCESSED: 1, 2, 3, 4, ... 456 💥 CRASH!
 
RESULT:
- 456 files processed
- 544 files stuck
- Everything waits for the next run

One bad file could delay everything.

Problem 4: No Parallelism

Our cron job processed files one at a time:

FILE 1: Process... done (2 seconds)
FILE 2: Process... done (2 seconds)
FILE 3: Process... done (2 seconds)
...
FILE 1000: Process... done (2 seconds)
 
TOTAL TIME: 2000 seconds = 33 minutes
 
But the cron runs every 15 minutes!
We're falling behind!

The Event-Driven Solution

Here's how we fixed it. Instead of one big job, we made many small jobs:

BEFORE (Batch):
 
Files → Wait → Cron Job → Process All → Done
        ^
        15 minutes of waiting
 
 
AFTER (Event-Driven):
 
File 1 uploaded → Lambda triggered → Processed → Done (5 seconds)
File 2 uploaded → Lambda triggered → Processed → Done (5 seconds)
File 3 uploaded → Lambda triggered → Processed → Done (5 seconds)
 
Each file processed immediately!

Think of it like a restaurant:

BATCH (Cafeteria):
- Wait until 12:15
- Everyone gets food at once
- Kitchen overwhelmed, then empty
 
EVENT-DRIVEN (Normal Restaurant):
- Order when you're ready
- Kitchen makes your food immediately
- Steady flow of orders, steady work

How It Works

Step 1: Something Happens

A scraper uploads a new file to our S3 bucket.

SCRAPER: "Here's a new file: car-listings-001.json"
         → Uploads to S3

Step 2: S3 Triggers Lambda

S3 Event Notifications automatically trigger our Lambda function:

S3: "Hey Lambda! A new file just appeared!"
    {
      bucket: "vehicle-uploads",
      key: "raw/car-listings-001.json",
      size: 25000
    }

Step 3: Lambda Queues the Work

Lambda doesn't do heavy processing. It just validates and queues to Amazon SQS:

LAMBDA: "Let me check this file..."
        - File exists? ✓
        - Not too big? ✓
        - Valid format? ✓
 
LAMBDA: "Okay, adding to processing queue."
        → Sends message to SQS

Why not process in Lambda directly?

LAMBDA LIMITS:
- 15 minute timeout
- Limited memory
- Pay per millisecond
 
BETTER APPROACH:
Lambda just queues (fast, cheap)
Workers do heavy lifting (scalable)

Step 4: Workers Process the Queue

Celery workers pick up messages from the queue:

WORKER 1: "I'll take message 1"  → Processing...
WORKER 2: "I'll take message 2"  → Processing...
WORKER 3: "I'll take message 3"  → Processing...
WORKER 4: "I'll take message 4"  → Processing...
 
All happening at the same time!

Step 5: Done!

File uploaded → Searchable in 30 seconds.

Not 15 minutes. 30 seconds.

The Visual Difference

BEFORE (Batch):
 
Time: 10:00    10:15    10:30    10:45
       │         │         │         │
       ▼         ▼         ▼         ▼
      idle     BURST     idle     BURST
               ████                ████
               ████                ████
               ████                ████
 
Work piles up → Explosion of activity → Quiet → Repeat
 
 
AFTER (Event-Driven):
 
Time: 10:00    10:15    10:30    10:45
       │         │         │         │
       ▼         ▼         ▼         ▼
       ██        ██        ██        ██
       ██        ██        ██        ██
       ██        ██        ██        ██
 
Steady, predictable work throughout the day

Error Handling Is Different

This was a big mindset shift.

Batch mindset: "If something fails, stop and investigate."

# Batch processing
for file in files:
    process(file)  # If this throws, everything stops!

Event-driven mindset: "If something fails, retry it. Don't stop the world."

# Event-driven processing
@task(max_retries=3)
def process_file(key):
    try:
        # Process this one file
    except Exception:
        # Retry later. Other files keep processing.
        retry()

Visual:

BATCH (One Bad Apple):
 
File 1 ✓
File 2 ✓
File 3 ✗ ERROR!
File 4 → Never processed
File 5 → Never processed
File 6 → Never processed
 
Everything stops.
 
 
EVENT-DRIVEN (Isolated Failures):
 
File 1 ✓ → Done
File 2 ✓ → Done
File 3 ✗ → Retry later
File 4 ✓ → Done
File 5 ✓ → Done
File 6 ✓ → Done
 
Bad file retries. Others continue.

The Dead Letter Queue

What about files that keep failing?

FILE PROCESSING ATTEMPTS:
 
Attempt 1: ✗ Failed (network timeout)
Attempt 2: ✗ Failed (network timeout)
Attempt 3: ✗ Failed (network timeout)
 
SYSTEM: "This file has failed 3 times."
        "Moving to Dead Letter Queue."
        "Alerting the ops team."

The Dead Letter Queue (DLQ) is like a hospital waiting room:

NORMAL QUEUE:        DEAD LETTER QUEUE:
"Healthy patients"   "Patients who need special attention"
 
Most files go here   Problem files end up here
Processed quickly    Investigated by humans

The Results

METRIC                  BEFORE          AFTER
─────────────────────────────────────────────────────
Latency (upload→search) 15 minutes      30 seconds
CPU pattern             Spiky (0→100%)  Steady (~40%)
Failure recovery        Wait for cron   Automatic retry
Parallel processing     1 at a time     50+ at once
Server cost             Over-provisioned Pay-per-use

The biggest win? Users got real-time updates.

When NOT to Use Event-Driven

Event-driven isn't always better. Use batch processing when:

✓ Order matters
  "Process files in exact upload order"
 
✓ You need transactions across items
  "Either all files succeed or all fail"
 
✓ Volume is low and predictable
  "We get 10 files per day"
 
✓ Simplicity is more important than speed
  "This is an internal tool, latency doesn't matter"

Use event-driven when:

✓ Latency matters
  "Users expect real-time updates"
 
✓ Load is unpredictable
  "Sometimes 10 files, sometimes 10,000"
 
✓ Failures should be isolated
  "One bad file shouldn't stop everything"
 
✓ You need to scale horizontally
  "Add more workers when needed"

Key Lessons

Lesson 1: Batch Processing Is a Latency Trap

The moment you say "runs every X minutes," you've capped your responsiveness. And as load grows, that batch job becomes a time bomb.

Lesson 2: Event-Driven Is Harder But Scales Better

More moving pieces. More things to monitor. But it scales horizontally and fails gracefully.

Lesson 3: The Trigger for Migration

When you catch yourself saying "users will just have to wait for the next batch run" that's when it's time to change.

Quick Reference

When to migrate from batch to event-driven:

✗ "Listings take 15 minutes to appear"
✗ "Our server maxes out every hour"
✗ "One bad file breaks everything"
✗ "We can't keep up with volume"
 
If you're saying these things, consider event-driven.

The basic pattern:

1. Event happens (file uploaded)
2. Trigger fires (S3 notification)
3. Work queued (SQS message)
4. Workers process (Celery tasks)
5. Failures retry automatically
6. Problem items go to DLQ

Summary

THE PROBLEM:
New listings took 15 minutes to appear
(Cron job ran every 15 minutes)
 
WHY IT HAPPENED:
Batch processing waits for scheduled runs
Like waiting for a bus vs calling an Uber
 
THE FIX:
Switch to event-driven architecture
Process each file as soon as it arrives
S3 → Lambda → SQS → Celery Workers
 
THE RESULT:
15 minutes → 30 seconds
Spiky CPU → Steady CPU
One failure stops all → Isolated failures

Batch processing is comfortable. It's predictable. It's also a trap. Real-time users deserve real-time processing.

Queue Sizing and Backpressure - Designing queues that don't overflow under load
Retry Storms - Handling failures in event-driven systems without cascading crashes
PostgreSQL Full-Text Search Limits - When we needed faster search to match real-time ingestion