This post describes a real project I worked on. Names, specific systems, and identifying details have been changed. The technical problems and approaches are real, but code snippets are simplified illustrations not production code.
The Request That Kept Coming Back
Every week, the same pattern:
"Can you pull records where region is North and signal is below -85?"
"I need everything where contractor is ABC OR contractor is XYZ, but NOT status INACTIVE."
"Records where name contains 'Tower' AND region is one of North, South, East."
I was becoming a human SQL interface for 50+ engineers. Each request took 10-15 minutes. Five requests a day. Over an hour daily, just writing SELECT statements.
I needed to build something that let them query data themselves. Without learning SQL.
The Architecture Decision
First question: how does the frontend tell the backend what to filter?
Option A: Structured JSON Clean for machines, but representing (A AND B) OR (C AND D) in nested JSON gets ugly. And when engineers looked at the JSON preview, their eyes glazed over.
Option B: Human-readable string region = 'North' AND signal < -85
Engineers understood Option B instantly. They think in conditions, not data structures.
The trade-off: now I had to build a parser.
The Parsing Challenge
The frontend sends: region = 'North' AND (signal < -85 OR capacity > 90)
I needed to turn this into a safe database query not by string concatenation (SQL injection nightmare), but by actually understanding the structure.
Classic compiler problem: tokenize → parse → generate.
Step 1: Tokenizing (The Pattern Ordering Trap)
Breaking the string into tokens seemed straightforward. Until I hit this:
Input: "status NOT IN (A, B)"My first tokenizer saw: status, NOT, IN, (A, B)
But I needed: status, NOT IN, (A, B)
The fix: order patterns from longest to shortest. Check for NOT IN before checking for NOT and IN separately.
# Order matters! Longer patterns first.
PATTERNS = [
(r'\bIS\s+NOT\s+NULL\b', 'IS_NOT_NULL'),
(r'\bIS\s+NULL\b', 'IS_NULL'),
(r'\bNOT\s+IN\b', 'NOT_IN'), # Before NOT and IN
(r'\bSTARTS_WITH\b', 'STARTS_WITH'),
(r'\bENDS_WITH\b', 'ENDS_WITH'),
(r'\bCONTAINS\b', 'CONTAINS'),
(r'\bAND\b', 'AND'),
(r'\bNOT\b', 'NOT'), # After NOT IN
(r'\bOR\b', 'OR'),
(r'\bIN\b', 'IN'), # After NOT IN
# ... operators, values, etc.
]Same issue with IS NOT NULL vs IS NULL vs IS. Pattern ordering is everything.
Step 2: Parsing (The Precedence Problem)
Tokens aren't enough. A AND B OR C is ambiguous.
Is it (A AND B) OR C or A AND (B OR C)?
I needed operator precedence. AND binds tighter than OR (like multiplication before addition). Parentheses override everything.
The trick: recursive descent with precedence levels.
def parse_expression(self):
return self.parse_or() # Lowest precedence
def parse_or(self):
left = self.parse_and() # AND binds tighter
while self.current_token() == 'OR':
self.consume('OR')
right = self.parse_and()
left = BinaryOp(left, 'OR', right)
return left
def parse_and(self):
left = self.parse_unary() # NOT binds tighter still
while self.current_token() == 'AND':
self.consume('AND')
right = self.parse_unary()
left = BinaryOp(left, 'AND', right)
return leftEach function calls the next-higher precedence level. The structure falls out naturally.
Result: an Abstract Syntax Tree.
Input: "region = 'North' AND (signal < -85 OR capacity > 90)"
Tree:
AND
/ \
region='North' OR
/ \
signal<-85 capacity>90Once I had the tree, generating SQL was just walking it.
Step 3: Safe SQL Generation (The Parameterization)
Never concatenate user input into SQL. Ever.
def generate_condition(self, node):
field = node.field.lower()
# Whitelist check
if field not in ALLOWED_FIELDS:
raise ValueError(f"Unknown field: {node.field}")
# Add parameter, return placeholder
self.params.append(node.value)
placeholder = f"${len(self.params)}"
return f"{field} {node.operator} {placeholder}"The output: "region = $1 AND (signal < $2 OR capacity > $3)" plus ['North', -85, 90]
Database driver handles escaping. SQL injection impossible.
The Clever Bit: Fuzzy Field Matching
Here's where I almost made a mistake.
First instinct: reject unknown fields with an error. But when someone types regoin instead of region, a generic "unknown field" error is frustrating.
The fix: fuzzy matching with suggestions.
from difflib import get_close_matches
def validate_field(self, field):
if field.lower() in ALLOWED_FIELDS:
return field.lower()
# Find close matches
suggestions = get_close_matches(
field.lower(),
ALLOWED_FIELDS,
n=1,
cutoff=0.6
)
if suggestions:
raise ValueError(f"Unknown field: '{field}'. Did you mean: {suggestions[0]}?")
else:
raise ValueError(f"Unknown field: '{field}'")Now regoin = 'North' returns: "Unknown field: 'regoin'. Did you mean: region?"
This one feature eliminated 80% of support requests overnight. Users fixed their own typos instead of asking me.
What Surprised Me
They wanted to save queries. I built "run and done." They wanted to save complex queries and rerun them weekly. Added that in week 2.
They shared queries. One engineer built a useful filter, told teammates. Added shared queries in week 3.
They never read documentation. Wrote a 3-page guide. Nobody opened it. Added tooltips and inline hints instead. Usage jumped immediately.
The Results
| Metric | Before | After |
|---|---|---|
| Data requests to me | 5/day | 0-1/day |
| Engineer wait time | 30 min - 2 hours | Instant |
| My time on data pulls | 1+ hour/day | ~10 min/week |
The real win: engineers started finding insights I never would have looked for. They knew the domain. Now they had the tool.
Lessons Learned
The string format was worth the complexity. JSON would've been easier to parse but harder for users to understand. UX won over engineering convenience.
Never trust input, but be helpful about it. Don't just reject bad queries suggest fixes. The difference between "error" and "did you mean...?" is the difference between frustration and success.
Pattern ordering is everything in tokenizers. Longer patterns first. I learned this the hard way with NOT IN vs NOT.
Related Reading
- Turning a 4-Hour Report Into a Button Click - The reports this tool replaced
- Scheduled Jobs That Actually Recover - Running queries automatically
- The XLSX From Hell - Another "simple" task that wasn't
