Back to Notes
·8 min read

Why Did My Celery Workers Keep Dying at 3am? (Debugging Python Memory Leaks)

#Python#Celery#Debugging#Production#Memory#Django
Why Did My Celery Workers Keep Dying at 3am? (Debugging Python Memory Leaks)

The Mystery

Every morning for a week, I woke up to alerts. My background workers were dead. Thousands of emails stuck in a queue. Users complaining.

The weird part? It always happened around 3am. Never during the day.

Let me explain what was happening and why - in simple terms.


What Is a "Worker" Anyway?

Think of a worker like a factory employee.

YOUR APP                          THE WORKER
┌──────────┐                      ┌──────────┐
│          │   "Send this email"  │          │
│  Website │  ────────────────►   │  Worker  │
│          │                      │          │
└──────────┘                      └──────────┘
 
You (the website) give tasks to the worker.
The worker does them in the background.
You don't wait - you move on immediately.

Why use workers?

Imagine you're at a restaurant:

  • Without workers: You order food, then stand at the kitchen door waiting 20 minutes
  • With workers: You order food, sit down, and someone brings it when ready

Workers let your website stay fast while slow stuff (emails, reports, image processing) happens in the background.


The Problem - Workers Get "Tired"

Here's what I didn't understand at first:

A worker is just a program running on a computer. And programs use memory.

Think of memory like a desk:

FRESH WORKER (just started)
 
┌─────────────────────────────────────────────┐
│                                             │
│   Clean desk!                               │
│   Plenty of space to work.                  │
│                                             │
│            📋                               │
│         (one task)                          │
│                                             │
└─────────────────────────────────────────────┘

Every time the worker does a task, it uses some desk space. When it's done, it should clean up.

But here's the problem: It doesn't clean up perfectly.

AFTER 1,000 TASKS
 
┌─────────────────────────────────────────────┐
│  📄 📋 📝 📄 📋 📄 📝 📋 📄 📝 📋 📄 📋   │
│  📝 📄 📋 📝 📄 📋 📝 📄 📋 📝 📄 📋 📝   │
│                                             │
│   Messy desk!                               │
│   Old papers piling up.                     │
│   Still works, but slower.                  │
│                                             │
└─────────────────────────────────────────────┘
 
 
AFTER 100,000 TASKS (3am)
 
┌─────────────────────────────────────────────┐
│ 📄📋📝📄📋📄📝📋📄📝📋📄📋📝📄📋📝📄📋📝 │
│ 📋📝📄📋📝📄📋📝📄📋📝📄📋📝📄📋📝📄📋📝 │
│ 📄📋📝📄📋📄📝📋📄📝📋📄📋📝📄📋📝📄📋📝 │
│ 📋📝📄📋📝📄📋📝📄📋📝📄📋📝📄📋📝📄📋📝 │
│                                             │
│   💀 DESK OVERFLOW! CRASH!                  │
│                                             │
└─────────────────────────────────────────────┘

Why 3am? Because by then, the worker had been running for ~18 hours, processing hundreds of thousands of tasks. The "desk" finally overflowed.


Why Doesn't Python Clean Up?

You might think: "But Python has garbage collection! It should clean up automatically!"

You're right. But here's the thing most people don't realize:

Python cleans up for Python. Not for the operating system.

Let me explain with an analogy:

IMAGINE A FILING CABINET
 
┌─────────────────────────┐
│ ┌───┐ ┌───┐ ┌───┐ ┌───┐ │
│ │ A │ │ B │ │ C │ │   │ │  ← Cabinet has 4 drawers
│ └───┘ └───┘ └───┘ └───┘ │    (memory from operating system)
└─────────────────────────┘
 
You store files A, B, C.
 
Now delete file B:
 
┌─────────────────────────┐
│ ┌───┐ ┌───┐ ┌───┐ ┌───┐ │
│ │ A │ │ 🗑 │ │ C │ │   │ │  ← Drawer B is now EMPTY
│ └───┘ └───┘ └───┘ └───┘ │    but still EXISTS
└─────────────────────────┘
 
Python: "Great! I can reuse drawer B!"
Operating System: "I still see 4 drawers being used."

The drawer is empty, but the cabinet is still there. Python won't return that drawer to the OS. It keeps it "just in case."

Over time:

  • Python keeps requesting more drawers
  • Python empties old drawers but doesn't return them
  • Eventually, the room fills with empty cabinets
  • Crash!

What Was Actually Leaking?

In my case, three things were piling up:

Database Connections

Every time my worker talked to the database, it opened a connection. Like making a phone call.

TASK 1: "Hey database, get me email #123"
        → Opens phone line 1 ☎️
 
TASK 2: "Hey database, get me email #456"
        → Opens phone line 2 ☎️ (didn't hang up line 1!)
 
TASK 3: "Hey database, get me email #789"
        → Opens phone line 3 ☎️ (still not hanging up!)
 
...after 1000 tasks...
 
☎️☎️☎️☎️☎️☎️☎️☎️☎️☎️☎️☎️☎️☎️☎️☎️☎️☎️☎️☎️
ALL LINES BUSY! CAN'T MAKE NEW CALLS!

Cached Data

My code was "remembering" things to be faster:

# This looks innocent
saved_templates = {}
 
def get_template(name):
    if name not in saved_templates:
        saved_templates[name] = load_from_disk(name)
    return saved_templates[name]

But over 100,000 tasks with different templates:

Hour 1:   saved_templates = {template1}
Hour 5:   saved_templates = {template1, template2, ..., template50}
Hour 10:  saved_templates = {template1, template2, ..., template200}
Hour 18:  saved_templates = {template1, template2, ..., template2000} 💀

It never forgot anything!

Memory Fragmentation

Even when Python "freed" memory, the space was fragmented:

CLEAN MEMORY:
[████████████████████████████████]
One big block - nice!
 
 
FRAGMENTED MEMORY (after many tasks):
[█░█░░█░█░░░█░█░░█░░░█░█░░█░█░░█]
 ↑   ↑     ↑
 Used Empty Used
 
Python: "I have lots of empty space!"
Reality: "It's all tiny holes. Can't fit anything big."

The Solution (It's Embarrassingly Simple)

After a week of debugging, the fix was one line of configuration (see the official Celery documentation on worker options):

# Before (workers live forever, accumulate garbage)
celery worker --concurrency=4
 
# After (workers restart fresh every 1000 tasks)
celery worker --concurrency=4 --max-tasks-per-child=1000

That's it.

What does this do?

WITHOUT max-tasks-per-child:
 
Worker 1: ═══════════════════════════════════════════► 💀 CRASH
          Start                                        3am
 
WITH max-tasks-per-child=1000:
 
Worker 1: ════════► 🔄 (restart fresh!)
          1000 tasks
 
Worker 1: ════════► 🔄 (restart fresh!)
          1000 tasks
 
Worker 1: ════════► 🔄 (restart fresh!)
          1000 tasks
 
...forever, always healthy!

Think of it like shifts at a factory:

  • Before: One employee works 24 hours straight until they collapse
  • After: Employees work 8-hour shifts, go home, fresh employee takes over

The Complete Fix

Here's my final configuration:

# celery.py
 
from celery import Celery
 
app = Celery('myapp')
 
app.conf.update(
    # 🔄 RESTART workers every 1000 tasks (prevents memory buildup)
    worker_max_tasks_per_child=1000,
 
    # 📥 Don't grab too many tasks at once (prevents memory spikes)
    worker_prefetch_multiplier=4,
 
    # ✅ Don't say "done" until actually done (prevents lost tasks)
    task_acks_late=True,
 
    # 🔁 If worker crashes, retry the task (nothing gets lost)
    task_reject_on_worker_lost=True,
)

And to fix the database connection problem (using Django's close_old_connections):

from django.db import close_old_connections
 
@shared_task
def process_email(email_id):
    # Hang up any old phone calls first
    close_old_connections()
 
    try:
        email = Email.objects.get(id=email_id)
        send_email(email)
    finally:
        # Hang up when done
        close_old_connections()

How Do I Know It's Working?

Now I monitor memory like this:

import psutil
 
def check_worker_health():
    memory_mb = psutil.Process().memory_info().rss / 1024 / 1024
 
    if memory_mb > 300:
        print(f"⚠️ Warning: Using {memory_mb}MB")
    if memory_mb > 400:
        print(f"🚨 Critical: Using {memory_mb}MB!")
 
    return memory_mb

Before the fix:

Memory over time:
100MB200MB300MB400MB500MB → 💀
 
Always going UP until crash.

After the fix:

Memory over time:
100MB150MB → 🔄 restart → 100MB150MB → 🔄 restart → 100MB
 
Stays flat! Never crashes!

Key Lessons

Lesson 1: Long-Running Processes Accumulate Garbage

Python programs aren't meant to run forever. They slowly fill up with:

  • Old database connections
  • Cached data
  • Memory fragments

Solution: Let them restart periodically.

Lesson 2: "Per-Task" Memory Is Misleading

Each task might only use 1MB. But after 100,000 tasks, your process uses 500MB.

The math doesn't add up because of hidden accumulation.

Solution: Monitor total process memory, not per-task memory.

Lesson 3: The Fix Is Often Configuration, Not Code

I spent a week reading code, looking for memory leaks. The fix was adding one flag:

--max-tasks-per-child=1000

Solution: Know your tools' configuration options.


Summary

THE PROBLEM:
Workers ran forever → Memory accumulated → 3am crash
 
THE CAUSE:
- Database connections not closing
- Caches growing forever
- Memory fragmentation
 
THE FIX:
1. Restart workers every N tasks (max-tasks-per-child)
2. Close database connections properly
3. Monitor total memory, not per-task memory
 
THE RESULT:
No more 3am crashes. I sleep peacefully now.

Quick Reference

If your Celery workers crash after running for hours:

# Add this to your celery config
app.conf.worker_max_tasks_per_child = 1000

If you use Django ORM in tasks:

from django.db import close_old_connections
 
@shared_task
def my_task():
    close_old_connections()
    try:
        # your code
    finally:
        close_old_connections()

If you have caches that grow forever (use Python's built-in lru_cache):

from functools import lru_cache
 
@lru_cache(maxsize=100)  # Limit to 100 items!
def get_template(name):
    return load_template(name)

That's it. Simple fixes for a problem that cost me a week of sleep.


Aamir Shahzad

Aamir Shahzad

Author

Software Engineer with 7+ years of experience building scalable data systems. Specializing in Django, Python, and applied AI.