Theme
What to Log for LLM Apps Before You Need It

What to Log for LLM Apps Before You Need It

7 min read

The first serious LLM incident I helped untangle did not fail because we logged too little. It failed because our logs did not join the right things. We had a request id in one place, a model call in another, and a tool write that looked successful but had no obvious parent. The approval that should have blocked the write was in a separate audit table. Nothing was missing in volume. The timeline was missing in structure.

That is the problem OpenTelemetry is built to solve. A trace is the path of a request through your application, and span events can act like structured log annotations on that path.1 OpenTelemetry also expects log records to carry trace context, including trace_id and span_id, so logs and traces can be correlated instead of stitched together by hand later.23

For LLM apps, that means the unit of observability is not “the prompt”. It is the request timeline.

TL;DR

  • Log the whole request path under one trace id.
  • Record versioned context, not just raw prompt text.
  • Treat tool calls as side effects with arguments, approvals, and external ids.
  • Keep redaction in the hot path and raw content in tighter stores.
  • If a write happened, you need to know who authorized it and what exact payload was reviewed.

The trace is the product

If you only have one observability primitive, make it the trace.

OpenTelemetry traces exist to describe the path a request takes across a system, and span events are the right place to attach meaningful points in time along that path.1 In an LLM app, those points are usually:

  • the request arrived
  • the context bundle was assembled
  • the model started
  • a tool was called
  • a tool returned
  • a human approved or rejected something
  • the response went out

That sequence is what you need when users ask, “What did the assistant see?” or “Why did it do that?” The answer is rarely in one log line. It is in the joins.

What to capture, by event

The shape below is small enough to ship and rich enough to debug.

EventMinimum fieldsWhy it matters
request.receivedtrace_id, span_id, actor id, tenant id, route, session id, request sourceAnchors the request and the user boundary
context.assembledprompt version, policy version, tool schema version, retrieval bundle hash, memory versionReconstructs what the model actually saw
model.startedmodel name, route, attempt, temperature, max tokens, cache statusTies quality and cost to a specific invocation
model.completedinput tokens, output tokens, latency, finish reason, retry countExplains spend, latency, and truncation
tool.calledtool name, redacted args, args hash, idempotency key, authorization scopeShows the exact action surface
tool.completedstatus, external object id, partial failure, response summary, durationReconstructs side effects and partial success
approval.requestedreviewed payload hash, target object id, requested action, expiryMakes human oversight auditable
approval.resolvedapprover id, decision, timestamp, scope, reason codeProves the gate actually happened
response.sentuser-visible outcome, status, final trace referenceCloses the loop for product and support

The important part is not the names. It is the fact that these are stable enough to trend and specific enough to replay.

Model calls: store versions, usage, and finish state

For a model call, the useful fields are boring:

  • model name
  • prompt or context version
  • retrieval bundle id or hash
  • input and output token counts
  • latency
  • finish reason
  • retry count
  • cache hit status

That is enough to answer three questions quickly:

  1. Did the call fail because the prompt changed?
  2. Did the call get expensive because the model was chatty or because retries spiked?
  3. Did the call stop because it hit a token limit, a timeout, or a policy boundary?

If you need the raw prompt body for debugging or evals, keep it in a separate store with tighter access and retention. Hot-path logs should point to the version, not duplicate the whole blob.

Tool calls: log the action surface, not just success

Tool calls are where LLM apps stop being “just text generation” and start being systems with side effects.

OWASP’s LLM Top 10 calls out prompt injection, insecure output handling, insecure plugin design, and excessive agency for a reason: once arbitrary text can influence a downstream action, the failure stops being cosmetic.4 OpenAI’s agent safety guidance says the same thing more directly: risk rises when arbitrary text influences tool calls, and tool approvals should stay on for MCP tools.5

So the trace for a tool call needs to answer:

  • what tool was selected
  • what arguments were passed
  • whether the args were validated
  • which principal or scope authorized the call
  • whether the tool wrote state
  • what external id came back
  • whether the call was retried with the same idempotency key

If the tool writes to an external system, the trace should let you answer whether the same operation would replay safely. If it cannot, that is a design problem, not just a logging problem.

Retrieval: log the evidence set, not the whole corpus

If you use search or RAG, log the retrieval decision separately from the final answer.

Capture:

  • query text or query hash
  • tenant, locale, and permission filters
  • top candidate document ids
  • reranker version and score snapshot
  • final selected chunk ids
  • source freshness or revision ids when available

This is the only way to distinguish “the model reasoned badly” from “the right evidence never made it into context”.

OpenTelemetry’s log and trace correlation model helps here because the retrieval step can live in the same trace as the model call, while keeping its own structured fields and timing.2

Approvals should be first-class events

If a human had to approve a side effect, that approval is part of the system state.

Log:

  • who approved
  • what action was approved
  • which object or account was in scope
  • what payload or diff was reviewed
  • when the approval expires
  • whether the eventual write matched the approved payload

That last field matters. If the final action diverged from the reviewed action, the approval is not a control. It is a screenshot.

NIST’s AI RMF explicitly treats human oversight as something to define, assess, and document, not something to leave implicit.6

Redaction needs to be designed, not improvised

Logging everything is lazy. Logging nothing is also lazy.

What should stay out of the hot path:

  • access tokens and secrets
  • raw customer data when a hash or id would do
  • full retrieved documents unless you have a clear retention and access story
  • full prompt bodies in the default app log stream

What should stay in:

  • trace ids
  • actor and tenant ids
  • hashes of sensitive payloads
  • version ids for prompts, tools, policies, and retrieval bundles
  • short summaries of tool and approval outcomes

OpenTelemetry’s logging spec is explicit that trace context belongs in logs when possible, which is a good reminder that correlation is a first-class concern, not an afterthought.2

A minimum viable schema

If I were instrumenting a new LLM app this week, every event would include:

  • timestamp
  • trace id
  • span id
  • actor id
  • tenant id
  • route or feature
  • event type
  • status
  • latency if applicable
  • token usage if applicable
  • version metadata for prompt, tool, policy, or retrieval bundle

That schema is enough to power alerts, debug support tickets, and build a cost dashboard that means something.

What good logging changes

The best side effect of good logging is that teams stop negotiating reality.

Once traces show prompt version, tool execution, and approval history in the same timeline, you can have a serious conversation about whether a failure came from the model, the tool contract, the policy layer, or the UI. That is the difference between “we think it happened” and “we can show it happened.”

If your logs let you answer that quickly, you are already ahead of most LLM products.

Footnotes

  1. OpenTelemetry traces: https://opentelemetry.io/docs/concepts/signals/traces/ 2

  2. OpenTelemetry logging spec: https://opentelemetry.io/docs/specs/otel/logs/ 2 3

  3. OpenTelemetry context propagation: https://opentelemetry.io/docs/concepts/context-propagation/

  4. OWASP Top 10 for Large Language Model Applications: https://owasp.org/www-project-top-10-for-large-language-model-applications/

  5. OpenAI, “Safety in building agents”: https://platform.openai.com/docs/guides/agent-builder-safety

  6. NIST AI RMF Core: https://airc.nist.gov/airmf-resources/airmf/5-sec-core/