What to Log for LLM Apps Before You Need It
The first serious LLM incident I helped untangle did not fail because we logged too little. It failed because our logs did not join the right things. We had a request id in one place, a model call in another, and a tool write that looked successful but had no obvious parent. The approval that should have blocked the write was in a separate audit table. Nothing was missing in volume. The timeline was missing in structure.
That is the problem OpenTelemetry is built to solve. A trace is the path of a request through your application, and span events can act like structured log annotations on that path.1 OpenTelemetry also expects log records to carry trace context, including trace_id and span_id, so logs and traces can be correlated instead of stitched together by hand later.23
For LLM apps, that means the unit of observability is not “the prompt”. It is the request timeline.
TL;DR
- Log the whole request path under one trace id.
- Record versioned context, not just raw prompt text.
- Treat tool calls as side effects with arguments, approvals, and external ids.
- Keep redaction in the hot path and raw content in tighter stores.
- If a write happened, you need to know who authorized it and what exact payload was reviewed.
The trace is the product
If you only have one observability primitive, make it the trace.
OpenTelemetry traces exist to describe the path a request takes across a system, and span events are the right place to attach meaningful points in time along that path.1 In an LLM app, those points are usually:
- the request arrived
- the context bundle was assembled
- the model started
- a tool was called
- a tool returned
- a human approved or rejected something
- the response went out
That sequence is what you need when users ask, “What did the assistant see?” or “Why did it do that?” The answer is rarely in one log line. It is in the joins.
What to capture, by event
The shape below is small enough to ship and rich enough to debug.
| Event | Minimum fields | Why it matters |
|---|---|---|
request.received | trace_id, span_id, actor id, tenant id, route, session id, request source | Anchors the request and the user boundary |
context.assembled | prompt version, policy version, tool schema version, retrieval bundle hash, memory version | Reconstructs what the model actually saw |
model.started | model name, route, attempt, temperature, max tokens, cache status | Ties quality and cost to a specific invocation |
model.completed | input tokens, output tokens, latency, finish reason, retry count | Explains spend, latency, and truncation |
tool.called | tool name, redacted args, args hash, idempotency key, authorization scope | Shows the exact action surface |
tool.completed | status, external object id, partial failure, response summary, duration | Reconstructs side effects and partial success |
approval.requested | reviewed payload hash, target object id, requested action, expiry | Makes human oversight auditable |
approval.resolved | approver id, decision, timestamp, scope, reason code | Proves the gate actually happened |
response.sent | user-visible outcome, status, final trace reference | Closes the loop for product and support |
The important part is not the names. It is the fact that these are stable enough to trend and specific enough to replay.
Model calls: store versions, usage, and finish state
For a model call, the useful fields are boring:
- model name
- prompt or context version
- retrieval bundle id or hash
- input and output token counts
- latency
- finish reason
- retry count
- cache hit status
That is enough to answer three questions quickly:
- Did the call fail because the prompt changed?
- Did the call get expensive because the model was chatty or because retries spiked?
- Did the call stop because it hit a token limit, a timeout, or a policy boundary?
If you need the raw prompt body for debugging or evals, keep it in a separate store with tighter access and retention. Hot-path logs should point to the version, not duplicate the whole blob.
Tool calls: log the action surface, not just success
Tool calls are where LLM apps stop being “just text generation” and start being systems with side effects.
OWASP’s LLM Top 10 calls out prompt injection, insecure output handling, insecure plugin design, and excessive agency for a reason: once arbitrary text can influence a downstream action, the failure stops being cosmetic.4 OpenAI’s agent safety guidance says the same thing more directly: risk rises when arbitrary text influences tool calls, and tool approvals should stay on for MCP tools.5
So the trace for a tool call needs to answer:
- what tool was selected
- what arguments were passed
- whether the args were validated
- which principal or scope authorized the call
- whether the tool wrote state
- what external id came back
- whether the call was retried with the same idempotency key
If the tool writes to an external system, the trace should let you answer whether the same operation would replay safely. If it cannot, that is a design problem, not just a logging problem.
Retrieval: log the evidence set, not the whole corpus
If you use search or RAG, log the retrieval decision separately from the final answer.
Capture:
- query text or query hash
- tenant, locale, and permission filters
- top candidate document ids
- reranker version and score snapshot
- final selected chunk ids
- source freshness or revision ids when available
This is the only way to distinguish “the model reasoned badly” from “the right evidence never made it into context”.
OpenTelemetry’s log and trace correlation model helps here because the retrieval step can live in the same trace as the model call, while keeping its own structured fields and timing.2
Approvals should be first-class events
If a human had to approve a side effect, that approval is part of the system state.
Log:
- who approved
- what action was approved
- which object or account was in scope
- what payload or diff was reviewed
- when the approval expires
- whether the eventual write matched the approved payload
That last field matters. If the final action diverged from the reviewed action, the approval is not a control. It is a screenshot.
NIST’s AI RMF explicitly treats human oversight as something to define, assess, and document, not something to leave implicit.6
Redaction needs to be designed, not improvised
Logging everything is lazy. Logging nothing is also lazy.
What should stay out of the hot path:
- access tokens and secrets
- raw customer data when a hash or id would do
- full retrieved documents unless you have a clear retention and access story
- full prompt bodies in the default app log stream
What should stay in:
- trace ids
- actor and tenant ids
- hashes of sensitive payloads
- version ids for prompts, tools, policies, and retrieval bundles
- short summaries of tool and approval outcomes
OpenTelemetry’s logging spec is explicit that trace context belongs in logs when possible, which is a good reminder that correlation is a first-class concern, not an afterthought.2
A minimum viable schema
If I were instrumenting a new LLM app this week, every event would include:
- timestamp
- trace id
- span id
- actor id
- tenant id
- route or feature
- event type
- status
- latency if applicable
- token usage if applicable
- version metadata for prompt, tool, policy, or retrieval bundle
That schema is enough to power alerts, debug support tickets, and build a cost dashboard that means something.
What good logging changes
The best side effect of good logging is that teams stop negotiating reality.
Once traces show prompt version, tool execution, and approval history in the same timeline, you can have a serious conversation about whether a failure came from the model, the tool contract, the policy layer, or the UI. That is the difference between “we think it happened” and “we can show it happened.”
If your logs let you answer that quickly, you are already ahead of most LLM products.
Footnotes
-
OpenTelemetry traces: https://opentelemetry.io/docs/concepts/signals/traces/ ↩ ↩2
-
OpenTelemetry logging spec: https://opentelemetry.io/docs/specs/otel/logs/ ↩ ↩2 ↩3
-
OpenTelemetry context propagation: https://opentelemetry.io/docs/concepts/context-propagation/ ↩
-
OWASP Top 10 for Large Language Model Applications: https://owasp.org/www-project-top-10-for-large-language-model-applications/ ↩
-
OpenAI, “Safety in building agents”: https://platform.openai.com/docs/guides/agent-builder-safety ↩
-
NIST AI RMF Core: https://airc.nist.gov/airmf-resources/airmf/5-sec-core/ ↩