Back to Notes
·7 min read

Which LLM For Which Task (And Why I Didn't Self-Host)

#LLM#OpenAI#Claude#Model Selection#Architecture
Which LLM For Which Task (And Why I Didn't Self-Host)

When I started building my email generation platform, I thought model selection was simple: use the best model for everything.

GPT-4 for email generation. GPT-4 for extraction. GPT-4 for classification.

Then I ran the numbers:

  • 5,000 emails/day
  • ~800 tokens average per email (prompt + response)
  • GPT-4: $0.03/1K input + $0.06/1K output
  • Daily cost: ~$180

That's $5,400/month just for LLM calls. Not sustainable.

I needed to actually think about which model for which task.


The Task Breakdown

My platform has different types of LLM tasks:

TASK REQUIREMENTS:
 
    Task                    Frequency    Latency    Quality
    ─────────────────────────────────────────────────────────
    Email preview           100/day      <3 sec     Good enough
    Final email generation  5,000/day    Don't care High
    Subject line generation 5,000/day    Don't care Medium
    Business info extraction 500/day     Don't care High accuracy
    Lead classification     5,000/day    Don't care Medium
    Content cleaning        500/day      Don't care Low

Different requirements → different models.


Why I Didn't Self-Host Small Models

First thought: "I'll run Llama 7B locally. Free inference!"

Tried it. Here's what I learned:

The Infrastructure Math

To run a 7B parameter model with decent speed:

  • Minimum GPU: RTX 3090 or better (24GB VRAM)
  • Server cost: ~$150/month (cloud) or $1,500+ upfront (own hardware)
  • Inference speed: ~20 tokens/second on good hardware
  • Concurrent requests: 1-2 per GPU

For 5,000 emails/day with ~200 tokens output each:

SELF-HOSTING MATH:
 
    1,000,000 tokens/day output
    ÷ 20 tokens/second
    = ~14 hours of GPU time
 
    Need multiple GPUs for reasonable throughput.
 
 
CLOUD GPU COSTS:
 
    AWS g4dn.xlarge: ~$0.50/hour × 14 hours = $7/day
    Plus setup, maintenance, scaling headaches.
 
 
API COSTS FOR SAME WORKLOAD:
 
    GPT-3.5: ~$2/day
    Claude Haiku: ~$1.50/day

The API was cheaper AND I didn't have to manage infrastructure.

The Quality Gap

I tested Llama 7B, Mistral 7B, and Phi-2 against GPT-3.5 for email generation:

QUALITY COMPARISON (out of 10):
 
    Model         Coherence  Tone Match  Instructions  Personalization
    ────────────────────────────────────────────────────────────────────
    GPT-4         9          9           10            9
    GPT-3.5       8          8           9             7
    Claude Sonnet 9          9           9             8
    Llama 7B      6          5           6             4
    Mistral 7B    7          6           7             5

The small models struggled with:

  • Maintaining consistent tone across paragraphs
  • Following complex formatting instructions
  • Incorporating multiple context pieces naturally
  • Not sounding robotic

For a product where email quality directly impacts results, "good enough" wasn't good enough.

When Small Models Make Sense

They're not useless. They work for:

  • Simple classification (spam/not spam)
  • Basic extraction with clear patterns
  • Internal tooling where quality bar is lower
  • High-volume, low-stakes tasks
  • When you have fine-tuning data for your specific use case

For my use case — cold emails that represent a brand — the quality gap was too visible.


The Model Selection I Landed On

class ModelRouter:
    TASK_MODELS = {
        # User-facing, needs speed
        'email_preview': {
            'model': 'gpt-3.5-turbo',
            'temperature': 0.7,
            'max_tokens': 500,
            'timeout': 10,
        },
 
        # Final output, quality matters
        'email_final': {
            'model': 'gpt-4',
            'temperature': 0.7,
            'max_tokens': 800,
            'timeout': 30,
            'fallback': 'claude-3-sonnet',
        },
 
        # Short, creative
        'subject_line': {
            'model': 'gpt-3.5-turbo',
            'temperature': 0.9,  # More creative
            'max_tokens': 50,
            'timeout': 5,
        },
 
        # Accuracy critical, structured output
        'extract_business_info': {
            'model': 'gpt-4',
            'temperature': 0.1,  # Deterministic
            'max_tokens': 300,
            'timeout': 15,
        },
 
        # Simple yes/no type tasks
        'classify_lead': {
            'model': 'gpt-3.5-turbo',
            'temperature': 0.1,
            'max_tokens': 10,
            'timeout': 5,
        },
 
        # Bulk cleaning, low stakes
        'clean_content': {
            'model': 'gpt-3.5-turbo',
            'temperature': 0.1,
            'max_tokens': 200,
            'timeout': 10,
        },
    }
 
    def get_config(self, task_type: str) -> dict:
        return self.TASK_MODELS.get(
            task_type,
            self.TASK_MODELS['email_preview']
        )

Temperature: The Setting I Ignored Too Long

I left temperature at default (0.7) for everything. Big mistake.

What Temperature Actually Does

TEMPERATURE GUIDE:
 
    0.0 - 0.3    Deterministic. Same input → nearly same output.
                 Good for: extraction, classification.
 
    0.4 - 0.6    Balanced. Some variation, still focused.
                 Good for: structured generation.
 
    0.7 - 0.8    Creative. Natural variation.
                 Good for: emails, content.
 
    0.9 - 1.0    Very creative. Risk of going off-track.
                 Good for: brainstorming, subject lines.

Real Example: Subject Lines

Temperature 0.3:

Email 1: "Quick question about your logistics software"
Email 2: "Quick question about your logistics platform"
Email 3: "Quick question about your logistics solution"

All similar. Boring. Users noticed.

Temperature 0.9:

Email 1: "The hidden cost in your current logistics setup"
Email 2: "What I noticed about [Company]'s shipping approach"
Email 3: "A 3-minute read that might save you 30 hours"

More variety. Occasionally weird ones, but mostly better.

The Problem: Batch Emails to Same Company

50 leads at one company. Same context. Same prompt structure.

With fixed temperature, emails were too similar. User complained: "These all sound the same."

Fix: Vary temperature slightly in batch processing

def get_temperature_for_batch(
    base_temp: float,
    index: int,
    batch_size: int
) -> float:
    """
    Vary temperature across a batch to get natural variation.
    Base 0.7 → actual range 0.65 to 0.80
    """
    variation = 0.15  # ±0.075 from base
    offset = (index / batch_size) * variation - (variation / 2)
    return max(0.1, min(1.0, base_temp + offset))

Now emails to the same company have natural variation without going off-brand.


Response Time Reality

I measured P95 response times across models:

LATENCY (seconds):
 
    Model              P50     P95     P99
    ───────────────────────────────────────
    GPT-3.5 Turbo      1.1     2.8     5.2
    GPT-4              3.2     8.5     15.0
    GPT-4 Turbo        2.1     5.2     9.0
    Claude 3 Haiku     0.8     1.5     2.5
    Claude 3 Sonnet    1.8     4.2     8.0

Implications:

  • Preview (user waiting): GPT-3.5 or Claude Haiku. 8 seconds is too long.
  • Batch processing: GPT-4 fine. Nobody watching.
  • Set realistic timeouts: GPT-4 with 5s timeout = lots of false failures.
def get_timeout(model: str) -> int:
    if 'gpt-4' in model and 'turbo' not in model:
        return 30  # It's slow, accept it
    elif 'gpt-4-turbo' in model:
        return 15
    elif 'gpt-3.5' in model:
        return 10
    elif 'haiku' in model:
        return 8
    elif 'sonnet' in model:
        return 15
    else:
        return 20

Fallback Chains: When Primary Fails

OpenAI goes down. What happens?

Before: Everything fails. Queue backs up. Users angry.

After: Automatic fallback.

FALLBACK_CHAINS = {
    'email_final': [
        {'provider': 'openai', 'model': 'gpt-4'},
        {'provider': 'anthropic', 'model': 'claude-3-sonnet'},
        {'provider': 'openai', 'model': 'gpt-3.5-turbo'},
    ],
    'email_preview': [
        {'provider': 'openai', 'model': 'gpt-3.5-turbo'},
        {'provider': 'anthropic', 'model': 'claude-3-haiku'},
    ],
    'extract_business_info': [
        {'provider': 'openai', 'model': 'gpt-4'},
        {'provider': 'anthropic', 'model': 'claude-3-sonnet'},
        # No GPT-3.5 fallback - quality too important
    ],
}
 
async def call_with_fallback(task_type: str, prompt: str) -> dict:
    chain = FALLBACK_CHAINS.get(
        task_type,
        FALLBACK_CHAINS['email_preview']
    )
 
    for option in chain:
        try:
            return await call_provider(
                option['provider'],
                option['model'],
                prompt,
                timeout=get_timeout(option['model'])
            )
        except (ProviderError, TimeoutError) as e:
            logger.warning(
                f"{option['model']} failed: {e}, trying next"
            )
            continue
 
    raise AllProvidersFailed(f"All models failed for {task_type}")

Key insight: Fallback quality should match task importance.

For extraction (accuracy critical), I'd rather fail than use GPT-3.5 and get wrong data.

For previews (speed critical), degraded quality is fine — user just wants to see something.


The Cost Breakdown After Optimization

DAILY COSTS:
 
    Task              Model       Volume     Cost
    ───────────────────────────────────────────────
    Email preview     GPT-3.5     100        $0.02
    Email final       GPT-4       5,000      $15.00
    Subject lines     GPT-3.5     5,000      $0.50
    Extraction        GPT-4       500        $1.50
    Classification    GPT-3.5     5,000      $0.25
    ───────────────────────────────────────────────
    TOTAL                                    ~$17/day

Down from $180/day with "GPT-4 for everything."

Still not cheap, but sustainable for a product that needs to scale.


Key Takeaways

WHAT I LEARNED:
 
Match model to task       GPT-4 for classification is like
                          hiring a lawyer to sort mail.
 
Small models aren't free  Infrastructure costs add up.
                          Quality gaps are real.
                          APIs are often cheaper.
 
Temperature is a tool     Extraction needs 0.1.
                          Creative needs 0.9.
                          Stop using defaults.
 
Measure latency           Users don't care how smart GPT-4 is
                          if they're waiting 10 seconds.
 
Build fallback chains     Provider outages happen.
                          Have a plan.
 
Vary temperature          Same prompt + same temperature =
in batches                similar outputs. Users notice.
 
Quality requirements      Not every task needs the best model.
drive model selection     But some do.



The real insight: model selection is a product decision, not a technical one. What's the user willing to wait for? What quality level do they actually need? Answer those first, then pick the model.

Aamir Shahzad

Aamir Shahzad

Author

Software Engineer with 7+ years of experience building scalable data systems. Specializing in Django, Python, and applied AI.