audit-prompt-caching is a portable Codex/agent skill for finding why LLM cache reuse fails across the request path: prompt/prefix caches, provider cache telemetry, cache-aware routing, agent tool stability, Bedrock checkpoints, OpenRouter routing drift, provider migration risk, and vLLM/SGLang KV reuse. LLM cache reuse usually fails silently. A timestamp in the system prompt, shuffled tool schemas
Add this skill
npx mdskills install sernote/audit-prompt-cachingaudit-prompt-caching is a portable Codex/agent skill for finding why LLM cache reuse fails across the request path: prompt/prefix caches, provider cache telemetry, cache-aware routing, agent tool stability, Bedrock checkpoints, OpenRouter routing drift, provider migration risk, and vLLM/SGLang KV reuse.
LLM cache reuse usually fails silently. A timestamp in the system prompt, shuffled tool schemas, a changed first user message, an OpenRouter fallback, or a new vLLM replica can turn repeated 20k-token requests into cold prefill again.
That failure is expensive because it often looks like a generic "LLM cost went up" or "agents got slower" incident. This skill gives agents a cache-specific audit path: inspect prefix stability, provider semantics, cache telemetry, routing locality, KV pressure, and whether caching is even the right lever.
Run the fixture audit locally:
git clone --depth 1 https://github.com/sernote/audit-prompt-caching.git
cd audit-prompt-caching
python3 audit-prompt-caching/scripts/analyze_usage_logs.py \
fixtures/openai/repeated_prefix_usage.jsonl
Render a report from the same fixture:
python3 audit-prompt-caching/scripts/render_audit_report.py \
--usage-log fixtures/openai/repeated_prefix_usage.jsonl \
--provider openai \
--engine "Responses API" \
--finding "fixtures/openai/repeated_prefix_usage.jsonl:1 | low | openai | cold request has zero cached tokens | first request pays full prefill | warm repeated prefix before measuring steady state | confirm warm cached_tokens increase"
Install as a Codex skill from GitHub:
tmp="$(mktemp -d)" && \
git clone --depth 1 https://github.com/sernote/audit-prompt-caching.git "$tmp" && \
mkdir -p ~/.codex/skills && \
rm -rf ~/.codex/skills/audit-prompt-caching && \
cp -R "$tmp/audit-prompt-caching" ~/.codex/skills/audit-prompt-caching && \
rm -rf "$tmp"
Then start a new Codex session and ask:
Use $audit-prompt-caching to audit this OpenAI app. cached_tokens stays at 0 even though the system prompt is 8k tokens.
+------------------------------------------------------------+
| LLM CACHE AUDIT |
+------------------------------------------------------------+
| Provider/API: openai / Responses API |
| Cache hit ratio: 59.62% |
| Output share: 7.17% |
| Main blocker: cold request has zero cached tokens |
| Cache impact: first request pays full prefill |
| Fix: warm repeated prefix before measuring steady state |
| Validate: confirm cached-token fields and TTFT improve |
+------------------------------------------------------------+
The bundled OpenAI fixture is synthetic and safe to share, but it is still executable evidence:
| Signal | Value |
|---|---|
| Records reviewed | 3 |
| Input tokens | 15,600 |
| Cached tokens | 9,300 |
| Cache hit ratio | 59.62% |
| Output share | 7.17% |
Example ROI model for 1,000 requests with 9k static input tokens, 300 dynamic input tokens, 2k output tokens, 71% cache hit rate, and explicit sample prices:
Total cost: $34.60 -> $23.10
Total savings: 33.24%
Input savings: 61.84%
These are fixture numbers, not a production guarantee. Always validate with your provider usage fields and billing export.
flowchart LR
A["stable tools / schemas"] --> B["stable system / developer instructions"]
B --> C["few-shot examples / static docs"]
C --> D["append-only conversation anchor"]
D --> E["late dynamic user data"]
A --> H["prefix + tool + schema hash"]
H --> I["provider cache read/write fields"]
I --> J["TTFT / cost / route metrics"]
This project is a static audit skill plus dependency-free local scripts. It complements runtime observability and gateway tools rather than replacing them.
| Project | Primary job | Static cache-path audit | Portable agent skill | Stdlib-only local scripts |
|---|---|---|---|---|
audit-prompt-caching | Cross-provider prompt/prefix/KV cache audit | yes | yes | yes |
| ussumant/cache-audit | Claude Code cache-rules skill | Claude-focused | Claude Code-focused | single skill |
| Helicone | LLM observability and gateway | runtime-oriented | no | no |
| Langfuse | LLM observability, evals, prompt management | runtime-oriented | no | no |
| LiteLLM | LLM gateway/proxy | runtime/gateway-oriented | no | no |
The skill includes small dependency-free helpers for repeatable audits:
python3 audit-prompt-caching/scripts/extract_llm_calls.py .
python3 audit-prompt-caching/scripts/layout_linter.py fixtures/layout/good_openai_request.json
python3 audit-prompt-caching/scripts/prefix_stability_check.py before.json after.json
python3 audit-prompt-caching/scripts/analyze_usage_logs.py usage.jsonl
python3 audit-prompt-caching/scripts/analyze_usage_logs.py --jsonl-normalized usage.jsonl
python3 audit-prompt-caching/scripts/estimate_cache_roi.py \
--static-tokens 9000 \
--dynamic-tokens 300 \
--output-tokens 2000 \
--requests 100 \
--hit-rate 0.8 \
--input-price-per-mtok 2.0 \
--cached-input-price-per-mtok 0.2 \
--output-price-per-mtok 8.0
python3 audit-prompt-caching/scripts/render_audit_report.py \
--usage-log fixtures/openai/repeated_prefix_usage.jsonl \
--provider openai \
--engine "Responses API" \
--finding "fixtures/openai/repeated_prefix_usage.jsonl:1 | low | openai | cold request has zero cached tokens | first request pays full prefill | warm repeated prefix before measuring steady state | confirm warm cached_tokens increase"
python3 audit-prompt-caching/scripts/validate_skill_package.py audit-prompt-caching
python3 audit-prompt-caching/scripts/run_trigger_eval.py audit-prompt-caching
prefix_stability_check.py compares raw bytes by default so JSON key-order drift is visible. Use --canonical-json only when sorted-key normalization is intentional.
Provider usage metadata and billing exports remain authoritative; these scripts are audit aids.
Use these as pressure scenarios, not generic smoke tests.
OpenAI-compatible wrapper ambiguity:
Use $audit-prompt-caching to review this app. It imports the OpenAI SDK, but base_url points to https://openrouter.ai/api/v1. We added prompt_cache_key, provider.order, and openrouter/auto; cache_write_tokens appears, but cached_tokens stays zero. Decide whether this is an OpenAI issue or a router/cache-locality issue.
Claude automatic caching writes every request:
Use $audit-prompt-caching to audit our Claude layout. We added top-level cache_control to an 18k-token policy prompt, then append timestamp and user question as the final content block. usage.cache_creation_input_tokens increments every request, but cache_read_input_tokens stays zero.
Bedrock Converse cross-region cachePoint:
Use $audit-prompt-caching to review this Bedrock Converse request. cachePoint is placed after a user-specific intro, tools differ by route, CacheWriteInputTokens is high, CacheReadInputTokens is near zero, and some traffic uses cross-region inference.
MCP tool registry drift:
Use $audit-prompt-caching to audit our coding agent. The MCP tool registry is queried every step, tool order changes with plugin load timing, read-only mode removes write tools, and compaction rewrites the first user turn. Costs rose even though each step sends fewer tools.
vLLM/SGLang multi-replica KV:
Use $audit-prompt-caching to inspect this self-hosted deployment. vLLM/SGLang replicas sit behind a generic gateway, p99 prompt length is 12k, max_model_len is 128k, prefix hashes look stable, but TTFT spikes after scaling and prefix-cache metrics vary by replica.
High cached tokens, low savings:
Use $audit-prompt-caching to explain why this workload still costs too much. cached_tokens is high and TTFT improved, but responses average 4k output tokens, tool calls add seconds, TPM errors did not improve, and finance wants to know whether prompt caching is the wrong lever.
audit-prompt-caching/
SKILL.md
agents/openai.yaml
references/
openai.md
openrouter.md
azure-openai.md
anthropic.md
bedrock.md
agent-tools.md
sglang.md
vllm.md
deepseek.md
economics.md
gemini.md
mechanics.md
predeploy-checklist.md
report-template.md
qwen.md
yandexgpt.md
zai.md
use-cases.md
scripts/
analyze_usage_logs.py
estimate_cache_roi.py
extract_llm_calls.py
layout_linter.py
prefix_stability_check.py
render_audit_report.py
validate_skill_package.py
run_trigger_eval.py
evals/
evals.json
trigger_eval.json
fixtures/
openai/
anthropic/
bedrock/
openrouter/
vllm/
expected/
Validate the skill package with the bundled validator:
python3 audit-prompt-caching/scripts/validate_skill_package.py audit-prompt-caching
python3 audit-prompt-caching/scripts/run_trigger_eval.py audit-prompt-caching
The repository also includes JSON eval prompts:
audit-prompt-caching/evals/evals.json: behavioral audit scenarios.audit-prompt-caching/evals/trigger_eval.json: should-trigger and should-not-trigger queries.Run the local script/package tests:
python3 -m unittest tests/test_prompt_cache_scripts.py
These evals are a starting point. A full proof cycle should still compare baseline agent behavior against behavior with the skill enabled.
CI runs the unittest suite, package validator, trigger eval, Python syntax compile, whitespace check, and generated-bytecode guard. Keep new scripts stdlib-only and add fixture-backed tests for behavior changes.
Provider cache behavior changes. The skill treats bundled provider references as heuristics and instructs the agent to verify official docs before exact claims about pricing, TTL, model support, field names, cache-control semantics, or routing hints.
MIT. See LICENSE.
Install via CLI
npx mdskills install sernote/audit-prompt-cachingLLM Cache Audit Skill is a free, open-source AI agent skill. audit-prompt-caching is a portable Codex/agent skill for finding why LLM cache reuse fails across the request path: prompt/prefix caches, provider cache telemetry, cache-aware routing, agent tool stability, Bedrock checkpoints, OpenRouter routing drift, provider migration risk, and vLLM/SGLang KV reuse. LLM cache reuse usually fails silently. A timestamp in the system prompt, shuffled tool schemas
Install LLM Cache Audit Skill with a single command:
npx mdskills install sernote/audit-prompt-cachingThis downloads the skill files into your project and your AI agent picks them up automatically.
LLM Cache Audit Skill works with Claude Code, Claude Desktop, Cursor, Vscode Copilot, Windsurf, Continue Dev, Codex, Gemini Cli, Amp, Roo Code, Goose, Opencode, Trae, Qodo, Command Code. Skills use the open SKILL.md format which is compatible with any AI coding agent that reads markdown instructions.