Audit Trails for AI Decisions: What to Log, What to Skip
The seven fields every AI audit log needs. Most teams log too much of what doesn't matter and not enough of what does.
TL;DR
Log these seven things for every AI decision in production:
- Input — what the agent saw (prompt + context).
- Output — what the agent said or did.
- Tool calls — every external action taken, with arguments and results.
- Model identity — which model and version produced the output.
- Decision metadata — confidence, alternatives considered, guardrails fired.
- Identity — which user, customer, or process triggered the agent.
- Timestamp — when, in a queryable format.
Skip: full token logs (too noisy), every retrieval (too expensive), debug-level traces (logs not records).
The seven fields every AI audit log needs. Most teams log too much of what doesn’t matter and not enough of what does.
The audit trail conversation usually starts after a regulator, a lawyer, or a customer asks a question that requires reconstructing what happened. By then it’s late — the answer depends on what was logged at the time. This piece is the spec for what a usable AI audit trail looks like, written before the question gets asked.
Why most AI logs are unusable
Three failure modes I see often.
1. Too much volume. Logs that capture every token and every retrieval at debug level produce gigabytes per agent per day. The volume makes the logs impossible to query in a reasonable time. The audit question goes unanswered because the data is too dense to use.
2. Too little structure. Free-text logs that aren’t parseable into records. The audit question becomes a grep exercise across files. Doesn’t scale; doesn’t give defensible answers.
3. Wrong abstractions. Logs designed for engineering debugging rather than for audit. Engineering logs care about latency and errors; audit logs care about decisions and actions. Different purpose, different shape.
The audit trail needs to be designed as a record system, not as a debugging output.
The seven fields
1. Input
What the agent saw to make its decision. The prompt as sent to the model, the retrieved context, the conversation history.
For privacy-sensitive cases: hash or redact the input before logging, but log enough that you can reconstruct what the agent saw at decision time.
2. Output
What the agent said or did. The model’s response, the action it executed, the message it returned to the user.
For multi-step agents: log each step’s output, not just the final one.
3. Tool calls
Every external action the agent took. Tool name, arguments passed, result returned. This is often the most important field for autonomous agents — the tool calls are how the agent affects the world.
For each tool call: tool ID, parameters, response, success/failure.
4. Model identity
Which model and which version produced the output. Specifically: provider, model name, model version, configuration parameters (temperature, max tokens, system prompt version).
This matters because behavior changes when models change. Without model identity, drift analysis becomes impossible.
5. Decision metadata
The structured information about how the decision was made. Confidence score (if the model produces one), alternatives considered (if you ran ranking), guardrails that fired (and what they did), retry counts.
This is what differentiates an audit log from a transaction log. The audit log captures the decision quality, not just the decision.
6. Identity
Who asked. Which user, which customer, which internal process triggered this agent invocation. With enough specificity to reconstruct the chain of authority.
For B2B contexts: customer org, user role, session ID. For internal use: employee ID, function, business context.
7. Timestamp
UTC, with millisecond precision, in a queryable format. Easy to overlook; expensive when wrong.
For multi-step agents: timestamp each step, not just the overall invocation.
What to skip
Full token logs. Recording every token the model produces is expensive and rarely needed. The output text is enough.
Every retrieval call. RAG agents make many retrievals per query. Logging each one separately produces volume without insight. Log the retrieval set as one record, not each retrieval as separate records.
Debug-level traces. These are engineering tools, not audit records. Different system; don’t conflate.
Sensitive data unredacted. PII, credentials, regulated data should be redacted or tokenized before logging. The audit log is now itself a security surface.
Storage and retention
Three considerations.
1. Storage cost vs. retention period. Audit logs typically need 1–7 year retention depending on regulation and industry. Storage cost adds up. Use cold storage tiers for older records.
2. Queryability. The logs are useful only if you can answer questions from them. “What did agent X do for customer Y on date Z?” should be a 30-second query, not a multi-day investigation.
3. Tamper-resistance. For high-stakes use cases (regulated industries, legal hold), the logs need to be append-only with tamper detection. Standard databases don’t provide this; specialized audit log platforms or blockchain-style hash chains do.
What to do this quarter
- Audit your current logs. Pick an agent in production. Try to answer “what happened in transaction X?” Time how long it takes. If more than 5 minutes, the audit trail is broken.
- Define the seven fields for your environment. Write the spec for what a complete audit record looks like.
- Implement on one agent first. Don’t try to retrofit everything at once. Pick the highest-stakes agent; get it right.
- Plan storage and retention. Talk to legal, talk to security, set retention policies that match obligations.
FAQ
Do we need to log every AI interaction, including ones from internal tools? For decisions affecting customers or regulated processes: yes. For pure productivity use (employee using ChatGPT to draft a doc): typically no, beyond the policy and shadow-AI controls covered in Shadow AI in Your Organization.
What about the model provider’s logs? Treat them as supplementary, not as your audit trail. You don’t control the provider’s retention, format, or access. Your audit trail is the records you control.
How does GDPR / CCPA affect audit logs? Customer rights to access and erasure apply to audit data that contains personal information. Build in support for both from day one. The “right to erasure” is particularly tricky if you’re using append-only storage.
Are audit logs admissible as legal evidence? With proper documentation of the logging process, retention, and tamper-resistance: yes. Without those, the records may be challenged. Talk to counsel about what evidentiary standard you need.
What’s the cost of comprehensive AI audit logging? For a typical mid-large enterprise: $0.10–$0.50 per agent invocation in storage and infrastructure cost. Add operational overhead for the audit team. Total annualized cost: $200K–$2M depending on scale. Cheap compared to the cost of not having it when you need it.
Working with JAIN on AI audit trails? We help executive teams design the records system that satisfies regulators, customers, and counsel. Book a 30-minute call.
Related reading:
Want to talk through this for your team?
30 minutes, no slides. We'll work the specific call your company is facing.