Forced tool use — getting Claude to return structured output that's actually reliable

Asking the model nicely for JSON works in demos and fails in production. Forced tool use is the pattern that turns 'usually returns valid JSON' into 'returns valid JSON or fails loudly.'

Apr 12, 2026

The first time you ship a Claude-backed feature into production, you discover something the demos don't tell you: "respond with valid JSON" is not a constraint. It's a suggestion. Most of the time the model honours it. Some percent of the time — and the percent depends on the model, the prompt length, and what mood the cosmos is in — the response comes back as a markdown-fenced block, or with a leading "Sure, here's the JSON:", or with a missing closing brace, or with an emoji where a string was expected.

If you're parsing that response in a try/catch and falling back to a default, you've shipped a class of bug you can't observe. The fallback quietly hides the parse failures. You think the feature works. The metrics say the feature works. The user gets a degraded experience and you find out two weeks later when someone reports it.

The fix isn't a more elaborate prompt. It's the wrong layer to fix it on. The fix is forced tool use.

The shape

Tool use lets you describe a function — name, description, JSON schema for the parameters — and have the model invoke it instead of replying in freeform text. With most providers you can additionally force the model to use a specific tool, which means: the model is not allowed to reply with plain prose. It must call the function with arguments that match your schema, or the call errors out.

That second property is the one that matters in production. You're no longer asking the model to be well-behaved. You're constraining the API surface so that "well-behaved" is the only shape it can return.

const message = await anthropic.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 2000,
  tools: [
    {
      name: "extract_metadata",
      description: "Extract structured metadata from the article.",
      input_schema: {
        type: "object",
        properties: {
          title: { type: "string" },
          summary: { type: "string", maxLength: 280 },
          tags: { type: "array", items: { type: "string" }, maxItems: 5 },
          language: { type: "string", enum: ["en", "ta", "si"] },
        },
        required: ["title", "summary", "language"],
      },
    },
  ],
  tool_choice: { type: "tool", name: "extract_metadata" },
  messages: [{ role: "user", content: articleText }],
});

The tool_choice field is the part that does the work. Without it, the model decides whether to use the tool or reply in prose. With it, the tool call is the only legal move.

Why it changes the failure shape

Before forced tool use, the failure mode of structured output is silent malformation. The response parses sometimes. The bugs are in the long tail of cases where it doesn't, and you only find them through user reports.

After forced tool use, the failure mode is loud and observable. Either the model returns a valid tool call that matches your schema, or the API returns an error you can catch, log, retry, or surface. There is no "kind of works." The contract is binary.

That's the win. Not that the model is more accurate — accuracy is the prompt's job. The win is that the boundary between your code and the model's output now has the same guarantees as any other typed API.

What I do downstream

Three patterns that compound the win:

Validate again with Zod after the tool call. Anthropic enforces the JSON schema you describe, but I run the result through a Zod schema in my own code anyway. The two checks aren't redundant — Anthropic enforces the shape; Zod enforces business invariants the JSON schema can't express ("language must be one of the tenant's supported languages, which depends on tenant config the model doesn't see"). If Zod fails, that's a bug, and it bubbles up to logging the same way any other validation error would.

Treat the tool call result as data, not commands. The temptation is to use a tool definition to make the model "do something" — call an API, write to a database, send a notification. I don't. I use forced tool use purely to constrain output shape. The model returns structured data; my code decides what to do with it. That separation makes the system testable: I can unit-test "given this metadata, do X" without involving the model, and I can unit-test "given this article text, the model returns the expected metadata" without involving the side-effect code.

Log every tool call with input + output + token cost. Production AI features without logging are debug-impossible. The tool-call-as-output pattern makes logging easy because the response is already structured — you just dump the input, output, and usage to wherever your application logs go. Over a few weeks of real traffic, those logs are the dataset you use to write your evals. (More on evals in a separate post.)

When NOT to force tool use

Two cases where I leave it off:

Open-ended creative output. If the feature is "generate a paragraph of plain text," forcing a tool call adds friction without value. Use freeform output, validate length, and accept that "the model wrote a paragraph" is the contract.
Multi-step reasoning where the steps aren't known in advance. If the model needs to call several tools in sequence, forcing one specific tool defeats the agent loop. Use tool_choice: "auto" and let the model pick. (Forcing tool use is for single-shot structured output, not for agent loops.)

The general rule

Most "hallucination" bugs in production AI features aren't actually hallucinations in the philosophical sense. They're API contract violations in a system that doesn't have a contract. Forced tool use is how you give the boundary between your code and the model the same shape as any other boundary in your system: a typed schema, a validation layer, an error path when something goes wrong.

You're not asking the model to behave. You're constraining the channel so behaviour is the only thing that fits through it.

That's not a prompting trick. It's an architecture choice — and it's the single biggest difference I've seen between AI features that survive production and AI features that quietly degrade until someone notices.

This pattern is part of the AI engineering surface I work on at the multi-tenant AI mirror-site platform.