Shipping Safe Tooling: Schemas, Validation, and Failure Modes in Tool Calling
The first time I watched a model call a tool in production, I felt the same kind of relief you feel when a flaky job finally goes green.
We had a workflow that used to require a human to copy data between systems. Then one day, it did not. The agent read the request, pulled the right record, updated the right fields, and posted a clean summary back to the user. A minute of work disappeared.
Two weeks later, that same tool path became the source of our first real incident.
Nothing “hacked” us. Nobody found a jailbreak. The model just did what models do: it made a confident decision based on an incomplete picture. It retried after a timeout. It guessed a missing parameter. It used the right tool with the wrong identity context. The end result was not catastrophic, but it was embarrassing and expensive. We had to unwind actions that looked legitimate on paper.
That is the reality of tool calling: once language turns into actions, you are no longer debating style or helpfulness. You are shipping a distributed system with an unreliable planner at the center.
This post is a guide to making that system safe enough to ship. Not perfect. Not “secure by prompt”. Safe enough that when the model is wrong, your product fails in a way you can live with.
Why tool calling changes the risk shape
Plain chat failures are reputational. Tool calling failures are operational.
If a chat assistant hallucinates, you get a wrong answer. If a tool-using agent hallucinates, you can get:
- wrong state written to a system of record
- duplicate actions (because retries are normal)
- accidental data exposure (because authorization is subtle)
- expensive cascades (because agents compound mistakes)
OWASP calls out the relevant buckets directly: prompt injection, insecure output handling, insecure plugin design, and excessive agency.1
If you want one phrase that captures why this is hard: tool calling turns your model into a confused deputy. It can be manipulated by inputs, and it can also simply misunderstand. Either way, it is still holding credentials.
A practical threat model for tool calling
Tool calling failures are not all “security” failures, but the mitigations often overlap. This table is how I frame it during design reviews:
| Failure mode | What it looks like in production | What usually fixes it |
|---|---|---|
| Prompt injection (direct or indirect) | Tool arguments that reflect instructions found in a document, email, or web page | Treat tool outputs and retrieved content as untrusted, limit scope, require confirmations for side effects |
| Insecure output handling | You execute model-generated commands, queries, or templates without validation | Strict schemas, server-side validation, safe encoders, parameterization |
| Excessive agency | A model has permissions it does not need, and a mistake becomes a real action | Least privilege, scoped tokens, separate read vs write tools, human approvals |
| Ambiguity and guessing | Model fills in missing parameters because “it seems obvious” | Make tools demand explicit fields, train the system to ask clarifying questions |
| Retry chaos | Timeouts cause duplicate actions, out-of-order effects, stuck loops | Idempotency keys, timeouts, bounded retries, state machines |
| Tool result poisoning | A tool returns untrusted text that the model treats as instructions | Put tool results in a “data-only” lane, strip/transform, and never feed raw tool output into “system” authority |
The main idea is simple: your mitigations cannot live only in prompts. They need to exist at the tool contract layer, the authorization layer, and the execution layer.
Design tools like public APIs (because they are)
The quickest way to hurt yourself is to treat tools as internal helper functions that only your model will call.
In practice, any tool that can be triggered by language should be treated like a public API:
- adversarial inputs exist, even if you do not want to think about them
- retries happen
- timeouts happen
- partial failure happens
- logs get audited later, by someone who was not in the room
That mindset changes the shape of your interfaces.
The schema is your first security boundary
Tool schemas are not just a convenience for parsing. They are a boundary. A good schema makes it hard for the model to “invent” dangerous flexibility.
This is the core pattern:
- constrain what the model can ask the tool to do
- validate that request as if it came from an untrusted client
- authorize it as the user, not as the model
Most tool APIs already use JSON Schema or something close to it. Anthropic’s tool use uses JSON Schema directly (input_schema).2 OpenAI’s Structured Outputs can enforce that a tool call matches the schema you provide when you opt into strict mode, but it does not make the values correct.3
That last clause matters. Schema adherence is not correctness. It is just a higher quality failure mode.
Schema patterns that reduce real incidents
When I look at tool calling incidents, the root cause is often “we gave the model too much room”. Here are patterns that reduce that room:
| Pattern | Do this | Avoid this |
|---|---|---|
| Separate read vs write tools | search_tickets, get_ticket, update_ticket | One tool that does everything with a mode string |
| Bound strings | max length, allowed chars, known prefixes | “Free-form text” for identifiers |
| Enumerate choices | enums for priority, status, currency, region | “any string” for fields that drive state |
| Make units explicit | amount_cents not amount | “amount”: 20 (is it dollars, cents, euros?) |
| Require stable identifiers | ticket_id, customer_id, account_id | Names that can collide |
| Reject unknown fields | fail closed | permissive parsing that quietly drops junk |
If you only take one recommendation from this post, take this: do not design schemas that encourage the model to smuggle intent in a text field.
A concrete tool schema example (language agnostic)
Here is what “less room” looks like as a tool contract. Imagine a support agent that can update a ticket.
The unsafe design is an update_ticket tool that takes a ticket_id and a freeform_update_text. That invites the model to blend intent, justification, policy, and state change into a single field.
The safer design forces the model to name the exact changes, and it gives you a place to validate invariants:
{
"name": "update_ticket",
"description": "Update specific fields on a ticket. Requires a stable id, explicit changes, and an idempotency key.",
"input_schema": {
"type": "object",
"additionalProperties": false,
"required": ["ticket_id", "patch", "idempotency_key"],
"properties": {
"ticket_id": { "type": "string", "pattern": "^TICK_[A-Z0-9]{6,}$" },
"idempotency_key": { "type": "string", "minLength": 16, "maxLength": 128 },
"patch": {
"type": "object",
"additionalProperties": false,
"properties": {
"status": { "type": "string", "enum": ["open", "pending", "resolved"] },
"priority": { "type": "string", "enum": ["p0", "p1", "p2", "p3"] },
"assignee_id": { "type": "string" }
}
}
}
}
}
This does not guarantee the model will make the correct choices, but it does force it to be explicit. It also gives your server a clean place to enforce: only certain roles can set priority, only certain users can assign, and resolved requires a resolution note on the ticket itself.
Validation: do not trust the model even when it is “structured”
Even if you use strict structured outputs, your tool server should validate like it is speaking to the internet.
That means:
- parse and validate arguments server-side
- clamp sizes (strings, arrays, nested objects)
- reject unknown fields
- validate cross-field invariants
- validate against current state (not just the payload)
Example: a transfer_funds tool that validates a JSON schema but does not validate “source account belongs to this user” is not safe. It is just neatly formatted.
OWASP’s “Insecure Output Handling” is basically this warning in a category label.1
Validation invariants that matter in practice
Schema validation catches shape errors. Production incidents often come from invariant errors:
- cross-field rules:
refund_amountmust be <=captured_amount - state machine rules: you cannot go from
resolvedback toopenwithout a reason - freshness rules: the update is based on stale state, so you need an ETag or version check
- ownership rules: the target resource belongs to a different tenant
These checks belong on the tool server. If you push them into the prompt, you are turning policy into a suggestion.
Authorization: the model should not hold the keys
The most dangerous pattern I see is a tool executor that runs as a broad service account and relies on prompt instructions for policy.
If you do that, prompt injection becomes privilege escalation. So does ordinary model confusion.
Instead:
- issue scoped, per-user tokens for tool calls where possible
- enforce RBAC and tenant boundaries at the API layer
- keep “admin” operations on a separate path with stronger approvals
- default tools to read-only, then explicitly grant write scope
This is the principle of least privilege. It is not new. Tool calling just makes it easy to forget because the model looks like “your code”.
Budgets and circuit breakers: put a ceiling on agency
Even with perfect schemas, a model can still be wrong in a costly way. A safe system has ceilings.
The ceilings I reach for first:
- max tool calls per run (and per tool)
- max wall-clock time per run
- max spend per run (tokens plus tool fees)
- rate limits per user and per tenant
- a kill switch that turns write tools off without redeploying
These are not just cost controls. They are safety controls. They turn “agent spiraled” into “agent stopped”.
Sandboxing: assume tools will be used in ways you did not intend
If any of your tools execute code, fetch URLs, open files, or touch customer data, you should assume they will be used in surprising ways. Sometimes that is adversarial. More often it is accidental, and the model is just creatively wrong.
My preference is to treat “powerful tools” like you would treat production cron: run them in constrained environments with a clear blast radius.
Examples of practical constraints:
- Code execution tools: container sandbox, read-only filesystem by default, no network by default, strict time and memory limits.
- Browser or fetch tools: domain allowlist, size limits, and a “download quarantine” step instead of direct ingestion.
- Data tools: field-level redaction, row limits, and explicit query budgets.
You can still build impressive agents inside these constraints. You just stop trusting the model to enforce them.
Retries are normal. Make them safe with idempotency.
Agents retry. They retry because networks fail, providers rate limit, and long-running operations time out. If your tools are not designed for retries, you will eventually double-charge, double-email, double-delete, or double-create.
You can borrow a pattern from payments: idempotency keys.
Stripe’s write-up on idempotency is worth reading even if you never touch payments. It describes the fundamental distributed-systems problem: if you do not know whether a request succeeded, you will retry, and the server needs a way to treat retries as the same operation.4
For tool calling, the practical translation is:
- Every side-effecting tool takes an
idempotency_key. - The tool server stores the outcome keyed by that value.
- Retries return the stored outcome instead of repeating the action.
This does not eliminate all complexity, but it turns “oops, we created five tickets” into “we returned the same ticket id five times”.
A retry policy table you can actually implement
| Tool class | Example | Safe to auto-retry? | Requirements |
|---|---|---|---|
| Read-only, deterministic | get_order_status | Yes | Cache, timeouts, exponential backoff |
| Read-only, expensive | search_logs | Yes, but bounded | Tight timeouts, pagination, budgets |
| Write with idempotency | create_refund | Yes | Idempotency key, audit log |
| Write without idempotency | send_email | No | Either make it idempotent or require a confirmation step |
| Irreversible destructive | delete_user | No | Two-phase commit, human confirmation, strong auth |
If you hear yourself saying “the model will probably not retry that”, assume the opposite.
A useful idempotency key convention for agents
If you already have a run_id, a practical pattern is:
idempotency_key = run_id + ":" + tool_name + ":" + step_index
It is not cryptography. It is just a stable name for “this exact intended action”, which is what the server needs to deduplicate.
Two-phase commit for agents: plan, then execute
In finance, we learn early that irreversible actions should have an approval step. Tool calling needs the same shape.
The common implementation pattern is a two-phase tool:
prepare_*returns a summary of intended effects and a short-livedconfirmation_token.commit_*executes using that token.
This gives you a place to insert:
- user confirmation in the UI
- policy checks (rate limits, fraud signals, business rules)
- a final sanity check (is the target still the same, is state still valid?)
It also makes incidents more recoverable. When the model confuses “cancel subscription” with “refund last invoice”, you want to catch that before it becomes a write.
An approvals table that helps teams align quickly
Teams often fight about where the approval line should be, because the model feels like it “should be able to handle it”.
This is a simple default that keeps you out of trouble:
| Action type | Examples | Suggested default |
|---|---|---|
| Reversible, low impact | add an internal comment, tag a ticket | auto-execute |
| Reversible, medium impact | assign ownership, change priority | auto-execute with tight auth and audit logs |
| Financial or customer-impacting | refunds, cancellations, outbound emails | two-phase commit with user confirmation |
| Destructive or irreversible | deletes, compliance exports, account bans | human approval plus stronger auth |
You can move the line over time. Start conservative and earn autonomy with data.
Tool results are data, not instructions
One of the most common self-inflicted wounds is piping raw tool output back into the model with full authority.
Tool output is untrusted input. Treat it like you would treat HTML from the internet:
- limit its size
- strip or transform markup
- avoid letting it control tool selection
- keep it in a “data-only” lane
This is also where vendor features can help. Anthropic’s advanced tool use work includes controls like tool search and caller constraints, which are signs that tool ecosystems are moving toward explicit boundaries instead of “anything goes”.5
Prefer structured tool results over raw text
If you can choose, have tools return structured data and short summaries, not a giant blob of text.
Two reasons:
- It is easier to validate and redact.
- It is harder for “instruction-like” strings to sneak into the agent loop.
If you need to return long text, consider returning a handle and requiring a separate get_* tool with explicit size limits and pagination.
A note on tool libraries (what I actually reach for)
This post is language agnostic on purpose, but I will admit a bias here: for tool schemas, I prefer “types you can run”.
In TypeScript, that often means a runtime validator like Zod that you can use to validate tool arguments on the server. In Python, Pydantic plays a similar role. The specific library matters less than the habit: validate at the boundary, and keep the boundary tight.
Observability: treat tool calls like money movement
If you cannot answer “who did what, and why” after an incident, you will end up turning off the agent.
At minimum, log these fields for every tool call:
| Field | Why you need it |
|---|---|
trace_id / run_id | tie tool calls to a single agent run |
user_id / tenant_id | enforce and audit authorization boundaries |
tool_name and version | schemas evolve, you need to know which contract applied |
| arguments (redacted) | debug and postmortem analysis |
| idempotency key | safe retries and deduplication |
| outcome | success, error class, partial |
| latency and provider errors | retry behavior and cost control |
This is the part that makes tooling feel like infrastructure work, because it is.
The uncomfortable truth: you cannot fully patch prompt injection in the model
I want to be careful here: we should absolutely keep improving model-level defenses. There are promising research directions like training models to resist injected instructions.6
But you cannot wait for a perfect model before shipping safe tools. You have to assume:
- the model will misunderstand
- the model will be tricked
- the model will occasionally do the wrong thing with a valid-looking request
So you design the system to minimize harm when the model is wrong. That is what “safe tooling” means in practice.
Opinion: safe tool calling is a tax, and it is worth paying
Tool calling makes demos impressive, and it makes incidents interesting. The safe, boring middle is where you want to live.
If you treat tools as public APIs, keep the model inside tight contracts, and rely on server-side validation and authorization, you can ship tool-using agents without holding your breath. You will still have failures, but you will have failures you can explain, debug, and recover from.
And if you are building in a litigious or regulated environment, or you want enterprise customers, this is not optional. It is the price of turning language into actions.