Multi-tenant AI mirror-site platform
Worked as the AI engineer on a multi-tenant B2B platform that turns customer websites into AI-friendly mirrors — crawl, structured extraction, translation, automated deploy. Live in production.
May 4, 2026
Problem
Constraints
What I built
Key decision
Outcome
Hindsight
Architecture notes
Three patterns carry most of the weight on the AI side of the platform.
One canonical tool schema, many providers
Every structured-output call in the pipeline goes through a single tool schema describing the extraction shape — the fields, types, constraints the downstream code expects. The same schema is registered with every LLM provider in the rotation. The rest of the system reads provider-agnostic results.
This shape isn't unique to LLM work — it's the same idea as a typed adapter layer over multiple databases or message brokers. The wrinkle specific to LLMs is that forced tool use is what makes the contract binary. The provider must call the tool with arguments matching the schema, or the call errors. There's no "kind of works" middle ground.
Eval harness as the deploy gate
The eval framework lives in a services/eval/ package with three
pieces:
- A runner that executes the full extraction pipeline against a curated case set
- A scorer that grades each output by schema correctness, property assertions, and a small set of regression cases
- Ground-truth JSON for representative customer sites, plus dated result runs that let me compare today's pass rate against last week's
The harness runs in CI on every change that touches prompts, tool schemas, or post-processing. If pass rate drops below a configured threshold for any tag, the build fails and the PR can't merge. That gate is what makes prompt-and-model changes safe to ship — not because the changes are guaranteed to be improvements, but because regressions are visible before they reach customers.
Budget guards as availability, not optimisation
LLM calls are external paid services. The budget layer treats them like any other untrusted upstream:
- Per-call
max_tokensceilings sized to the feature, not the model default - Per-tenant daily and hourly caps, enforced before the API call via Redis-cached counters
- A global daily ceiling as kill switch
- Cost-per-call computed at log time and stored alongside latency and token counts, so cost dashboards are a SQL aggregate, not a join across pricing tables
- Mock-mode fallback when an API key is missing or a provider errors out — the pipeline returns a degraded result instead of cascading failure into the rest of the system
The result is that the worst-case AI bill on this platform is a known, bounded number. That property — a number you can write down — is what distinguishes "we built an AI feature" from "we run an AI feature in production."
What this work taught me that's transferable
LLM features that survive in production aren't the ones with the fanciest prompts. They're the ones where the boundary between code and model is treated as the same kind of contract as any other API boundary — typed input, typed output, observability, cost ceilings, fallback paths. Most of the work isn't model work; it's plumbing discipline applied to a new shape of dependency. That's the lens I take into any AI-backed feature now.