This post describes a real project I worked on. Names, specific systems, and identifying details have been changed. The technical problems and approaches are real, but code snippets are simplified illustrations not production code.
The Monday Morning Ritual
Every Monday at 9am, I watched the same thing happen.
The analyst opened Excel. Then System A the asset database. Export. Wait. Download CSV. Copy 2,000 rows.
Then System B the contractor portal. Different login. Different UI. Export. Wait. Copy 1,500 rows.
Then System C the ticketing system. Another login. Export. Wait. Copy 800 rows.
Then the real work started. VLOOKUPs to match records across systems. Conditional formatting to flag problems. Manual cross-referencing. Pivot tables for the summary.
Four hours. Every Monday.
By the time the report was finished, leadership was asking about Tuesday's issues. The data was already stale.
I watched this twice. Then I asked: "What if there was just a button?"
The Three-Headed Data Monster
Each system had a different way to get data out.
System A had a proper REST API. Paginated JSON. The dream.
System B had no API. A CSV appeared on an SFTP server every night at 3am. Classic legacy.
System C had a SOAP API. In 2024. XML everywhere. I questioned my career choices.
The first win was just getting all three into the same Python script. Three protocols, three auth methods, three data formats all landing in Pandas DataFrames.
The Matching Problem
Here's where it got interesting.
Each system used different ID formats:
- System A:
SITE-001 - System B:
site_001 - System C:
S001
Same record. Three different identifiers.
The fix: Normalize before merging.
def normalize_id(raw_id):
"""SITE-001, site_001, S001 → SITE001"""
return re.sub(r'[^A-Z0-9]', '', str(raw_id).upper())
df_assets['record_id'] = df_assets['raw_id'].apply(normalize_id)
df_contractors['record_id'] = df_contractors['site_reference'].apply(normalize_id)
df_tickets['record_id'] = df_tickets['asset_id'].apply(normalize_id)Once IDs lined up, Pandas merges just worked.
The Anomaly Detection Breakthrough
This was my favorite part.
The analyst spent an hour manually checking: "Is every record in System A also in System B?" Spot-checking 20% of rows, hoping to catch mismatches.
I automated it with set operations:
# Get unique IDs from each system
ids_assets = set(df_assets['record_id'])
ids_contractors = set(df_contractors['record_id'])
ids_tickets = set(df_tickets['record_id'])
# Find the gaps
missing_from_contractors = ids_assets - ids_contractors
missing_from_assets = ids_contractors - ids_assets
orphan_tickets = ids_tickets - ids_assets
anomalies = pd.DataFrame({
'record_id': (
list(missing_from_contractors) +
list(missing_from_assets) +
list(orphan_tickets)
),
'issue': (
['Missing from contractor system'] * len(missing_from_contractors) +
['In contractor system but not in assets'] * len(missing_from_assets) +
['Ticket exists for unknown asset'] * len(orphan_tickets)
)
})First automated run: 47 anomalies.
The analyst stared at the screen. "The manual process usually caught maybe 10."
The manual process sampled. The automated process checked everything.
The Unexpected Win: Data Quality
Those 37 extra anomalies per week? Real problems hiding in plain sight.
- Records created in one system but never synced to another
- Tickets filed against assets that had been decommissioned
- Contractors working on sites that weren't in the official database
The tool didn't just save time. It found problems humans were missing.
Why Excel, Not a Dashboard
I could have built a fancy web dashboard. Real-time updates. Interactive charts.
Leadership wanted Excel.
Not because they're dinosaurs. Because Excel is their workflow. They filter, sort, pivot, add comments, forward to their teams. A dashboard would've meant changing how they work.
The best tool is the one people actually use. I matched their format. They didn't even notice the process changed they just noticed Monday reports arrived before lunch.
The Results
| Metric | Before | After |
|---|---|---|
| Time to generate | 4 hours | 45 seconds |
| Frequency | Weekly (Monday) | On-demand |
| Anomalies caught | ~10/week | ~45/week |
| Human errors | Occasional VLOOKUP mistakes | Zero |
The analyst's Monday transformed. Click button at 9:05. Review by 9:10. Actual analytical work by 9:30.
The time saved wasn't the biggest win. The biggest win was spending four hours investigating anomalies instead of finding them.
Lessons Learned
Watch before you build. I didn't ask "what do you need?" I watched them work. The requirements they would've given: "A better reporting tool." What I actually built: three data adapters, ID normalization, and set-based anomaly detection.
Normalize IDs early. Different systems, different formats. One normalization function solved a week of merge headaches.
Set operations find gaps fast. ids_a - ids_b finds every record in A that's missing from B. One line. Checks everything.
Match the output format. They wanted Excel. I gave them Excel. "Better" that nobody uses is worse than "same" that everyone uses.
Related Reading
- When the Frontend Sends a Query as a String - The next tool I built for this team
- Scheduled Jobs That Actually Recover - Automating this report
- Dashboards Leadership Actually Uses - When dashboards make sense
