Multi-tenant AI mirror-site platform

Worked as the AI engineer on a multi-tenant B2B platform that turns customer websites into AI-friendly mirrors — crawl, structured extraction, translation, automated deploy. Live in production.

May 4, 2026

Live B2B product. Brand confidential at this stage — happy to discuss the work in conversation.

Problem

The product turns a customer's source website into a set of AI-friendly mirror sites — translated, structured, kept in sync, deployed to a CDN with versioning. The technical problem is harder than it sounds: crawl reliably across arbitrary site shapes; extract structured content from HTML that wasn't designed for extraction; translate consistently across long-running tenants; and deploy versioned mirrors without manual intervention. All of it has to be multi-tenant, cost-bounded per customer, and observable enough that drift in any layer is caught before a customer reports it.

Constraints

B2B SaaS — paying customers, real money flowing, correctness non- negotiable. Multi-provider LLM strategy from day one — the extraction layer can't be coupled to one provider's exact response shape, because vendor outages and per-tenant routing both have to work without code changes. Per-customer budget caps must enforce *before* the API call, not after — surprise bills are a company-level risk. Eval gates have to block deploys when prompt or model changes degrade extraction quality, because the failure mode of "the model started doing something subtly worse" is invisible without explicit measurement.

What I built

Worked across the AI engineering surface of the platform — the pipeline that takes a customer site and produces a deployed mirror. - **Three apps + eight shared packages** in a Turbo monorepo: `admin` (operator surface), `client-portal` (customer surface), `public-site` (marketing); shared packages for `api`, `auth`, `billing`, `database`, `email`, `jobs`, `services`, `storage`. - **Self-hosted crawler service** for scraping and full-site mapping — the team built its own rather than depending on a third-party crawl API, to control rate limits, robots.txt compliance, and cost. - **Multi-provider LLM extraction layer** with one canonical tool schema — every provider (Anthropic primary; OpenAI, Google, Groq, DeepSeek as alternates) returns the same structured shape, described once and routed by tenant or feature config. - **Mirror-site pipeline** composing crawl → forced tool-use extraction → translation → automated deploy of versioned mirrors. - **Eval framework** — runner, scorer, ground-truth JSON for representative sites, dated result runs that block deploys on regression. - **AI logger** persisting full input/output payloads to JSONB so any historical run can be replayed offline for re-scoring against new rubrics. - **Daily AI budget enforcement** — global + per-account caps, Redis- cached, fail-closed-over-budget / fail-open-on-DB-error, with a kill-switch ceiling. - **robots.txt compliance, rate limiting, mock-mode fallback** for missing API keys so the pipeline degrades gracefully instead of hard-failing during incidents or local development.

Key decision

**Multi-provider extraction with one canonical tool schema.** The naïve approach is to ship a Claude-only integration first and add fallbacks later. We didn't. From early on, every provider returns the same shape — described once as a tool schema, used identically by Claude, OpenAI, Google, Groq, DeepSeek. The cost is real abstraction work upfront. The payoff is three properties that compound: 1. Providers can be swapped per-tenant or per-feature without touching the extraction layer. 2. Provider outages don't take down the pipeline — the next provider handles the call. 3. Adding a new provider is a config entry, not an integration sprint. For a B2B SaaS where extraction quality is the differentiator and provider lock-in is a real risk, that abstraction is load-bearing.

Outcome

Live in production. ~129 tables in Postgres backing the multi-tenant data model. Eval harness covers ground-truth cases for representative customer sites; new cases can be added with a JSON file. Per-tenant AI cost dashboards show spend in real time, with daily caps enforced *before* every API call so the worst-case bill is bounded by arithmetic, not by hope.

Hindsight

The decision I'd defend most strongly is having the eval harness in place early. Once it existed, prompt edits and model swaps stopped being "ship and watch" and started being "ship and gate." For any AI-backed feature I work on next, the harness comes before the first prompt change, not after the third. The decision I'd reconsider is some of the early features that predated the canonical multi-provider tool schema. Refactoring those to fit the unified shape took longer than building them provider-agnostically would have on day one. The lesson generalises: when you know an abstraction is coming, paying for it before the feature ships is cheaper than retrofitting it after.

Architecture notes

Three patterns carry most of the weight on the AI side of the platform.

One canonical tool schema, many providers

Every structured-output call in the pipeline goes through a single tool schema describing the extraction shape — the fields, types, constraints the downstream code expects. The same schema is registered with every LLM provider in the rotation. The rest of the system reads provider-agnostic results.

This shape isn't unique to LLM work — it's the same idea as a typed adapter layer over multiple databases or message brokers. The wrinkle specific to LLMs is that forced tool use is what makes the contract binary. The provider must call the tool with arguments matching the schema, or the call errors. There's no "kind of works" middle ground.

Eval harness as the deploy gate

The eval framework lives in a services/eval/ package with three pieces:

A runner that executes the full extraction pipeline against a curated case set
A scorer that grades each output by schema correctness, property assertions, and a small set of regression cases
Ground-truth JSON for representative customer sites, plus dated result runs that let me compare today's pass rate against last week's

The harness runs in CI on every change that touches prompts, tool schemas, or post-processing. If pass rate drops below a configured threshold for any tag, the build fails and the PR can't merge. That gate is what makes prompt-and-model changes safe to ship — not because the changes are guaranteed to be improvements, but because regressions are visible before they reach customers.

Budget guards as availability, not optimisation

LLM calls are external paid services. The budget layer treats them like any other untrusted upstream:

Per-call max_tokens ceilings sized to the feature, not the model default
Per-tenant daily and hourly caps, enforced before the API call via Redis-cached counters
A global daily ceiling as kill switch
Cost-per-call computed at log time and stored alongside latency and token counts, so cost dashboards are a SQL aggregate, not a join across pricing tables
Mock-mode fallback when an API key is missing or a provider errors out — the pipeline returns a degraded result instead of cascading failure into the rest of the system

The result is that the worst-case AI bill on this platform is a known, bounded number. That property — a number you can write down — is what distinguishes "we built an AI feature" from "we run an AI feature in production."

What this work taught me that's transferable

LLM features that survive in production aren't the ones with the fanciest prompts. They're the ones where the boundary between code and model is treated as the same kind of contract as any other API boundary — typed input, typed output, observability, cost ceilings, fallback paths. Most of the work isn't model work; it's plumbing discipline applied to a new shape of dependency. That's the lens I take into any AI-backed feature now.

Tech stack

Anthropic Claude APIMulti-provider LLM (OpenAI, Google, Groq, DeepSeek)Forced tool useNext.js 15tRPC v11TypeScriptDrizzle ORMPostgreSQLpnpm + Turbo monorepoInngest (jobs)Clerk (auth)Stripe (billing)Resend (email)Cloudflare R2 (storage)