What to Log for LLM Apps Before You Need It

The first serious LLM incident I helped untangle did not fail because we logged too little. It failed because our logs did not join the right things. We had a request id in one place, a model call in another, and a tool write that looked successful but had no obvious parent. The approval that should have blocked the write was in a separate audit table. Nothing was missing in volume. The timeline was missing in structure.

That is the problem OpenTelemetry is built to solve. A trace is the path of a request through your application, and span events can act like structured log annotations on that path.¹ OpenTelemetry also expects log records to carry trace context, including trace_id and span_id, so logs and traces can be correlated instead of stitched together by hand later.²³

For LLM apps, that means the unit of observability is not “the prompt”. It is the request timeline.

TL;DR

Log the whole request path under one trace id.
Record versioned context, not just raw prompt text.
Treat tool calls as side effects with arguments, approvals, and external ids.
Keep redaction in the hot path and raw content in tighter stores.
If a write happened, you need to know who authorized it and what exact payload was reviewed.

The trace is the product

If you only have one observability primitive, make it the trace.

OpenTelemetry traces exist to describe the path a request takes across a system, and span events are the right place to attach meaningful points in time along that path.¹ In an LLM app, those points are usually:

the request arrived
the context bundle was assembled
the model started
a tool was called
a tool returned
a human approved or rejected something
the response went out

That sequence is what you need when users ask, “What did the assistant see?” or “Why did it do that?” The answer is rarely in one log line. It is in the joins.

What to capture, by event

The shape below is small enough to ship and rich enough to debug.

Event	Minimum fields	Why it matters
`request.received`	`trace_id`, `span_id`, actor id, tenant id, route, session id, request source	Anchors the request and the user boundary
`context.assembled`	prompt version, policy version, tool schema version, retrieval bundle hash, memory version	Reconstructs what the model actually saw
`model.started`	model name, route, attempt, temperature, max tokens, cache status	Ties quality and cost to a specific invocation
`model.completed`	input tokens, output tokens, latency, finish reason, retry count	Explains spend, latency, and truncation
`tool.called`	tool name, redacted args, args hash, idempotency key, authorization scope	Shows the exact action surface
`tool.completed`	status, external object id, partial failure, response summary, duration	Reconstructs side effects and partial success
`approval.requested`	reviewed payload hash, target object id, requested action, expiry	Makes human oversight auditable
`approval.resolved`	approver id, decision, timestamp, scope, reason code	Proves the gate actually happened
`response.sent`	user-visible outcome, status, final trace reference	Closes the loop for product and support

The important part is not the names. It is the fact that these are stable enough to trend and specific enough to replay.

Model calls: store versions, usage, and finish state

For a model call, the useful fields are boring:

model name
prompt or context version
retrieval bundle id or hash
input and output token counts
latency
finish reason
retry count
cache hit status

That is enough to answer three questions quickly:

Did the call fail because the prompt changed?
Did the call get expensive because the model was chatty or because retries spiked?
Did the call stop because it hit a token limit, a timeout, or a policy boundary?

If you need the raw prompt body for debugging or evals, keep it in a separate store with tighter access and retention. Hot-path logs should point to the version, not duplicate the whole blob.

Tool calls: log the action surface, not just success

Tool calls are where LLM apps stop being “just text generation” and start being systems with side effects.

OWASP’s LLM Top 10 calls out prompt injection, insecure output handling, insecure plugin design, and excessive agency for a reason: once arbitrary text can influence a downstream action, the failure stops being cosmetic.⁴ OpenAI’s agent safety guidance says the same thing more directly: risk rises when arbitrary text influences tool calls, and tool approvals should stay on for MCP tools.⁵

So the trace for a tool call needs to answer:

what tool was selected
what arguments were passed
whether the args were validated
which principal or scope authorized the call
whether the tool wrote state
what external id came back
whether the call was retried with the same idempotency key

If the tool writes to an external system, the trace should let you answer whether the same operation would replay safely. If it cannot, that is a design problem, not just a logging problem.

Retrieval: log the evidence set, not the whole corpus

If you use search or RAG, log the retrieval decision separately from the final answer.

Capture:

query text or query hash
tenant, locale, and permission filters
top candidate document ids
reranker version and score snapshot
final selected chunk ids
source freshness or revision ids when available

This is the only way to distinguish “the model reasoned badly” from “the right evidence never made it into context”.

OpenTelemetry’s log and trace correlation model helps here because the retrieval step can live in the same trace as the model call, while keeping its own structured fields and timing.²

Approvals should be first-class events

If a human had to approve a side effect, that approval is part of the system state.

Log:

who approved
what action was approved
which object or account was in scope
what payload or diff was reviewed
when the approval expires
whether the eventual write matched the approved payload

That last field matters. If the final action diverged from the reviewed action, the approval is not a control. It is a screenshot.

NIST’s AI RMF explicitly treats human oversight as something to define, assess, and document, not something to leave implicit.⁶

Redaction needs to be designed, not improvised

Logging everything is lazy. Logging nothing is also lazy.

What should stay out of the hot path:

access tokens and secrets
raw customer data when a hash or id would do
full retrieved documents unless you have a clear retention and access story
full prompt bodies in the default app log stream

What should stay in:

trace ids
actor and tenant ids
hashes of sensitive payloads
version ids for prompts, tools, policies, and retrieval bundles
short summaries of tool and approval outcomes

OpenTelemetry’s logging spec is explicit that trace context belongs in logs when possible, which is a good reminder that correlation is a first-class concern, not an afterthought.²

A minimum viable schema

If I were instrumenting a new LLM app this week, every event would include:

timestamp
trace id
span id
actor id
tenant id
route or feature
event type
status
latency if applicable
token usage if applicable
version metadata for prompt, tool, policy, or retrieval bundle

That schema is enough to power alerts, debug support tickets, and build a cost dashboard that means something.

What good logging changes

The best side effect of good logging is that teams stop negotiating reality.

Once traces show prompt version, tool execution, and approval history in the same timeline, you can have a serious conversation about whether a failure came from the model, the tool contract, the policy layer, or the UI. That is the difference between “we think it happened” and “we can show it happened.”

If your logs let you answer that quickly, you are already ahead of most LLM products.

OpenTelemetry traces: https://opentelemetry.io/docs/concepts/signals/traces/ ↩ ↩²
OpenTelemetry logging spec: https://opentelemetry.io/docs/specs/otel/logs/ ↩ ↩² ↩³
OpenTelemetry context propagation: https://opentelemetry.io/docs/concepts/context-propagation/ ↩
OWASP Top 10 for Large Language Model Applications: https://owasp.org/www-project-top-10-for-large-language-model-applications/ ↩
OpenAI, “Safety in building agents”: https://platform.openai.com/docs/guides/agent-builder-safety ↩
NIST AI RMF Core: https://airc.nist.gov/airmf-resources/airmf/5-sec-core/ ↩