Back to Notes
·9 min read

Why Was Our AI Taking 1.2 Seconds to Write an Email? (Optimizing LLM Validation)

#LLM#Python#Performance#Pydantic#OpenAI#Optimization
Why Was Our AI Taking 1.2 Seconds to Write an Email? (Optimizing LLM Validation)

The Slowness

Our AI email system was taking 1.2 seconds to write each email.

The AI itself took 800ms. That's expected it's doing complex work.

But we were adding another 400ms just to check if the email was okay.

That's like baking a cake in 8 minutes and then spending 4 minutes deciding if it looks good enough.


What Is "Validation"?

First, let's understand what we were doing.

Validation means checking that something is correct before using it.

AI GENERATES:
{
  "subject": "Special offer just for you!",
  "body": "Hi John, check out our new product...",
  "price": "$29.99",
  "product_id": "SKU-12345"
}
 
VALIDATION CHECKS:
✓ Is the subject under 60 characters?
✓ Does the product ID actually exist?
✓ Is the price correct?
✓ No inappropriate content?
✓ Personalization looks right?

Each check takes time. We had five checks, each taking 50-150ms.


The Problem - One at a Time

Here's how our validation worked:

AI RESPONSE ARRIVES (800ms)


Check 1: Schema validation     →  5ms


Check 2: Product exists?100ms (database lookup)


Check 3: Price correct?50ms (another database lookup)


Check 4: Content safety?150ms (API call to moderation)


Check 5: Personalization OK?50ms (template checking)


DONE!
 
TOTAL VALIDATION TIME: 5 + 100 + 50 + 150 + 50 = 355ms

Each check waited for the previous one to finish.

Like standing in five separate lines at the DMV.


Solution 1 - Run Checks in Parallel

The realization: Most checks don't depend on each other.

SEQUENTIAL (One at a time):
─────────────────────────────────────────────────────►
│ Check 1 │ Check 2 │ Check 3 │ Check 4 │ Check 5
   5ms       100ms     50ms      150ms     50ms
 
Total: 355ms
 
 
PARALLEL (All at once):
─────────────────────────────────────────►
│ Check 15ms
│ Check 2 │────────────│ 100ms
│ Check 3 │──────│ 50ms
│ Check 4 │───────────────────│ 150ms
│ Check 5 │──────│ 50ms
 
Total: 150ms (longest check wins)

Think of it like a kitchen:

SEQUENTIAL COOKING:
Chef: "I'll make salad first"        (5 min)
Chef: "Now I'll cook the pasta"      (20 min)
Chef: "Now I'll grill the chicken"   (15 min)
Chef: "Now I'll prepare dessert"     (10 min)
 
Total: 50 minutes
 
 
PARALLEL COOKING:
Chef 1: "I'll make salad"            (5 min)
Chef 2: "I'll cook pasta"            (20 min)  ← Longest task
Chef 3: "I'll grill chicken"         (15 min)
Chef 4: "I'll prepare dessert"       (10 min)
 
Total: 20 minutes (everyone works at once)

Result: 355ms → 150ms. We saved 57% of time just by running things simultaneously.


Solution 2 - Not Everything Needs to Block

Here's a key insight:

Some checks MUST pass before we continue. Others can happen in the background.

BLOCKING CHECKS (Must pass now):
─────────────────────────────────
• Is the JSON valid?           (If not, we can't use it at all)
• Is the subject too long?     (Will break email clients)
• Are there template errors?   (Would look unprofessional)
 
These are FAST (1-5ms each) and CRITICAL.
 
 
BACKGROUND CHECKS (Can pass later):
──────────────────────────────────
• Does product exist?          (Rare failure, can fix)
• Content moderation           (Rare issues, can suppress)
• Detailed quality scoring     (Nice to have)
 
These are SLOW (50-150ms each) but NOT urgent.

The new flow:

AI RESPONSE ARRIVES

     ├──► BLOCKING CHECKS (5ms total)
     │    ✓ Schema valid?
     │    ✓ Subject length?
     │    ✓ No template errors?

     ├──► RETURN RESPONSE TO USER (fast!)

     └──► BACKGROUND CHECKS (happen after)
          • Product validation
          • Content moderation
          • Quality scoring
          → If any fail, flag for review

Think of it like airport security:

MUST CHECK NOW (Blocking):
• Do you have a boarding pass?
• Is your ID valid?
→ Can't fly without these.
 
CAN CHECK LATER (Background):
• Does your luggage have prohibited items?
→ We'll catch it, but you can board while we check.

Solution 3 - Remember What You Already Checked

Many validations repeat. Why check the same thing twice?

EMAIL 1: "Check if product SKU-123 exists"
         → Database lookup → YES, exists (100ms)
 
EMAIL 2: "Check if product SKU-123 exists"
         → Database lookup → YES, exists (100ms)
 
EMAIL 3: "Check if product SKU-123 exists"
         → Database lookup → YES, exists (100ms)
 
TOTAL: 300ms for the same answer three times!

With caching:

EMAIL 1: "Check if product SKU-123 exists"
         → Database lookup → YES, exists (100ms)
         → Save result for 1 hour
 
EMAIL 2: "Check if product SKU-123 exists"
         → Check cache → YES, exists (1ms)
 
EMAIL 3: "Check if product SKU-123 exists"
         → Check cache → YES, exists (1ms)
 
TOTAL: 102ms

Think of it like a phone contact list:

WITHOUT CACHING:
"What's Mom's phone number?"
→ Look in paper address book (30 seconds)
 
"What's Mom's phone number?"
→ Look in paper address book (30 seconds)
 
WITH CACHING:
"What's Mom's phone number?"
→ Look in paper address book (30 seconds)
→ Save to phone contacts
 
"What's Mom's phone number?"
→ Check phone contacts (1 second)

Solution 4 - Start Checking Before It's Done

Here's a clever trick: Start validating while the AI is still writing.

NORMAL APPROACH:
 
AI: "Generating... generating... generating... DONE!"
     ─────────────────────────────────────────────────►


                                              Start validation
 
 
STREAMING APPROACH:
 
AI: "Subject: Special offer just for you..."

     Can we validate subject NOW? Yes!
     ↓ (Continue generating)
AI: "Body: Hi John, check out..."

     Can we check for forbidden words? Yes!
     ↓ (Continue generating)
AI: "...DONE!"

     Almost everything already validated!

Think of it like proofreading a letter as someone writes it:

WITHOUT STREAMING:
Writer: *writes entire letter*
Writer: "Done! Can you proofread?"
Proofreader: "Sure, give me 5 minutes"
 
WITH STREAMING:
Writer: "Dear..."
Proofreader: "Looks good so far"
Writer: "...Sir/Madam..."
Proofreader: "Good, keep going"
Writer: "...I am writing to..."
Proofreader: "Wait, you misspelled something!"
Writer: *fixes immediately*
 
When the letter is done, proofreading is almost done too!

Solution 5 - Load Everything Into Memory

Database lookups are slow. Memory lookups are fast.

SLOW (Database lookup every time):
"Does product SKU-123 exist?"
→ Send query to database
→ Database searches millions of rows
→ Database returns answer
100ms
 
FAST (Memory lookup):
"Does product SKU-123 exist?"
→ Check set in memory
→ {"SKU-001", "SKU-002", ..., "SKU-123", ...}
→ Yes, it's in the set
0.001ms

The tradeoff:

  • Uses more memory
  • Need to refresh periodically (data might change)
  • Worth it for data that changes rarely

Think of it like a cheat sheet:

WITHOUT CHEAT SHEET:
"What's the formula for area of a circle?"
→ Open textbook
→ Find chapter on circles
→ Find the formula
2 minutes
 
WITH CHEAT SHEET:
"What's the formula for area of a circle?"
→ Look at cheat sheet on desk
→ πr²
2 seconds

Putting It All Together

Here's how I validate now:

AI RESPONSE ARRIVES


┌───────────────────────────────────────────────┐
STEP 1: Fast Blocking Checks (5ms total)     │
│  ────────────────────────────────────────     │
│  • Is JSON valid?
│  • Subject under 60 chars?
│  • No template artifacts?
│  • Product ID in memory cache?
│                                               │
│  If ANY fail → Reject immediately             │
└───────────────────────────────────────────────┘

        │ All passed

┌───────────────────────────────────────────────┐
STEP 2: Return Response (0ms)                │
│  ─────────────────────────────                │
│  User gets the email content NOW
└───────────────────────────────────────────────┘

        │ Meanwhile, in the background...

┌───────────────────────────────────────────────┐
STEP 3: Background Checks (parallel)         │
│  ────────────────────────────────────         │
│  • Content safety moderation                  │
│  • Detailed quality scoring                   │
│  • Personalization audit                      │
│                                               │
│  If ANY fail → Flag for review, maybe stop    │
└───────────────────────────────────────────────┘

The Results

BEFORE (Naive approach):
─────────────────────────────────
AI generation:        800ms
Validation:           400ms (sequential, all blocking)
TOTAL:               1200ms
 
 
AFTER (Optimized approach):
─────────────────────────────────
AI generation:        800ms
Blocking validation:   15ms (parallel, cached, in-memory)
Background:            0ms (happens after response)
TOTAL:                815ms

Validation time: 400ms → 15ms (96% reduction!)

We went from adding 50% overhead to adding 2% overhead.


When NOT to Use These Tricks

These optimizations work when:

  • Background failures are rare (< 1%)
  • You can fix problems after the fact
  • Speed matters more than perfect accuracy

They DON'T work when:

  • Every output must be verified before use
  • Failures are common
  • A bad output causes serious harm
GOOD USE CASES:
• Marketing emails (can suppress bad ones)
• Product recommendations (can show generic fallback)
• Content suggestions (user can ignore)
 
BAD USE CASES:
• Financial transactions (must verify before executing)
• Medical advice (can't risk bad output)
• Legal documents (must be 100% accurate)

Key Lessons

Lesson 1: Sequential Is the Enemy of Speed

If checks don't depend on each other, run them at the same time. This alone cut our validation from 355ms to 150ms.

Lesson 2: Not Everything Is Urgent

Some checks can happen after you've already responded. Move slow, non-critical checks to the background.

Lesson 3: Cache Everything You Can

If you've checked something once and it won't change soon, remember the answer.

Lesson 4: Memory Is Faster Than Databases

If your validation data fits in memory and doesn't change often, load it once and keep it there.


Quick Reference

Parallel validation (using Python's asyncio.gather):

results = await asyncio.gather(
    check_1(),
    check_2(),
    check_3(),
)

Tiered validation (block vs background):

# Must pass now (fast checks)
validate_schema(content)        # 1ms
validate_format(content)        # 1ms
 
# Can pass later (slow checks)
background_task.delay(content)  # 0ms now, runs later

Caching (remember answers):

cache_key = f"product_exists:{product_id}"
if cache.get(cache_key):
    return True  # 1ms
else:
    result = database.lookup(product_id)  # 100ms
    cache.set(cache_key, result, ttl=3600)
    return result

Summary

THE PROBLEM:
Validation added 400ms to every AI response
Total time: 800ms (AI) + 400ms (validation) = 1200ms
 
WHY IT HAPPENED:
- Checks ran one at a time (sequential)
- Every check blocked the response
- Same checks repeated without caching
- Database lookups instead of memory lookups
 
THE FIX:
1. Run checks in parallel (355ms150ms)
2. Move non-critical checks to background (150ms5ms)
3. Cache repeated lookups (5ms2ms)
4. Use memory instead of database (2ms<1ms)
 
THE RESULT:
400ms validation → 15ms validation
96% reduction in validation overhead

Don't skip validation. Make it faster.


  • Preventing LLM Hallucinations - The validation rules that catch AI mistakes before they reach customers
  • Retry Storms - What happens when your LLM fallback chain triggers too many retries
  • Queue Sizing - Managing queues when validation adds latency to your pipeline
Aamir Shahzad

Aamir Shahzad

Author

Software Engineer with 7+ years of experience building scalable data systems. Specializing in Django, Python, and applied AI.