audit-prompt-caching is a portable Codex/agent skill for finding why LLM cache reuse fails across the request path: prompt/prefix caches, provider cache telemetry, cache-aware routing, agent tool stability, Bedrock checkpoints, OpenRouter routing drift, provider migration risk, and vLLM/SGLang KV reuse. LLM cache reuse usually fails silently. A timestamp in the system prompt, shuffled tool schemas
Add this skill
npx mdskills install sernote/audit-prompt-caching1# LLM Cache Audit Skill23[](https://github.com/sernote/audit-prompt-caching/actions/workflows/ci.yml)4[](LICENSE)56789`audit-prompt-caching` is a portable Codex/agent skill for finding why LLM cache reuse fails across the request path: prompt/prefix caches, provider cache telemetry, cache-aware routing, agent tool stability, Bedrock checkpoints, OpenRouter routing drift, provider migration risk, and vLLM/SGLang KV reuse.1011## Why This Exists1213LLM cache reuse usually fails silently. A timestamp in the system prompt, shuffled tool schemas, a changed first user message, an OpenRouter fallback, or a new vLLM replica can turn repeated 20k-token requests into cold prefill again.1415That failure is expensive because it often looks like a generic "LLM cost went up" or "agents got slower" incident. This skill gives agents a cache-specific audit path: inspect prefix stability, provider semantics, cache telemetry, routing locality, KV pressure, and whether caching is even the right lever.1617## Quick Start1819Run the fixture audit locally:2021```bash22git clone --depth 1 https://github.com/sernote/audit-prompt-caching.git23cd audit-prompt-caching24python3 audit-prompt-caching/scripts/analyze_usage_logs.py \25 fixtures/openai/repeated_prefix_usage.jsonl26```2728Render a report from the same fixture:2930```bash31python3 audit-prompt-caching/scripts/render_audit_report.py \32 --usage-log fixtures/openai/repeated_prefix_usage.jsonl \33 --provider openai \34 --engine "Responses API" \35 --finding "fixtures/openai/repeated_prefix_usage.jsonl:1 | low | openai | cold request has zero cached tokens | first request pays full prefill | warm repeated prefix before measuring steady state | confirm warm cached_tokens increase"36```3738Install as a Codex skill from GitHub:3940```bash41tmp="$(mktemp -d)" && \42git clone --depth 1 https://github.com/sernote/audit-prompt-caching.git "$tmp" && \43mkdir -p ~/.codex/skills && \44rm -rf ~/.codex/skills/audit-prompt-caching && \45cp -R "$tmp/audit-prompt-caching" ~/.codex/skills/audit-prompt-caching && \46rm -rf "$tmp"47```4849Then start a new Codex session and ask:5051```text52Use $audit-prompt-caching to audit this OpenAI app. cached_tokens stays at 0 even though the system prompt is 8k tokens.53```5455## Audit Hero Shot5657```text58+------------------------------------------------------------+59| LLM CACHE AUDIT |60+------------------------------------------------------------+61| Provider/API: openai / Responses API |62| Cache hit ratio: 59.62% |63| Output share: 7.17% |64| Main blocker: cold request has zero cached tokens |65| Cache impact: first request pays full prefill |66| Fix: warm repeated prefix before measuring steady state |67| Validate: confirm cached-token fields and TTFT improve |68+------------------------------------------------------------+69```7071## Fixture Signal7273The bundled OpenAI fixture is synthetic and safe to share, but it is still executable evidence:7475| Signal | Value |76|---|---:|77| Records reviewed | 3 |78| Input tokens | 15,600 |79| Cached tokens | 9,300 |80| Cache hit ratio | 59.62% |81| Output share | 7.17% |8283Example ROI model for 1,000 requests with 9k static input tokens, 300 dynamic input tokens, 2k output tokens, 71% cache hit rate, and explicit sample prices:8485```text86Total cost: $34.60 -> $23.1087Total savings: 33.24%88Input savings: 61.84%89```9091These are fixture numbers, not a production guarantee. Always validate with your provider usage fields and billing export.9293## Cache Flow9495```mermaid96flowchart LR97 A["stable tools / schemas"] --> B["stable system / developer instructions"]98 B --> C["few-shot examples / static docs"]99 C --> D["append-only conversation anchor"]100 D --> E["late dynamic user data"]101 A --> H["prefix + tool + schema hash"]102 H --> I["provider cache read/write fields"]103 I --> J["TTFT / cost / route metrics"]104```105106## Positioning107108This project is a static audit skill plus dependency-free local scripts. It complements runtime observability and gateway tools rather than replacing them.109110| Project | Primary job | Static cache-path audit | Portable agent skill | Stdlib-only local scripts |111|---|---|---:|---:|---:|112| `audit-prompt-caching` | Cross-provider prompt/prefix/KV cache audit | yes | yes | yes |113| [ussumant/cache-audit](https://github.com/ussumant/cache-audit) | Claude Code cache-rules skill | Claude-focused | Claude Code-focused | single skill |114| [Helicone](https://github.com/Helicone/helicone) | LLM observability and gateway | runtime-oriented | no | no |115| [Langfuse](https://github.com/langfuse/langfuse) | LLM observability, evals, prompt management | runtime-oriented | no | no |116| [LiteLLM](https://github.com/BerriAI/litellm) | LLM gateway/proxy | runtime/gateway-oriented | no | no |117118## Who It Is For119120- AI engineers debugging prompt-cache misses or long TTFT.121- Backend engineers building LLM request paths.122- Agent developers working with tools, MCP, compaction, or coding assistants.123- Platform/SRE engineers running vLLM, SGLang, or multi-replica inference.124- Teams comparing providers or estimating effective LLM cost.125126## What It Audits127128- Prompt-cache applicability before recommending changes.129- Stable prompt prefix layout.130- Volatile data in system prompts and early messages.131- Non-deterministic tool/schema serialization.132- Dynamic tool sets inside agent loops.133- History truncation, compaction, and summarization.134- Cache-aware routing for managed and self-hosted inference.135- OpenRouter sticky routing, provider fallback, and cache read/write fields.136- Amazon Bedrock cache checkpoints and read/write fields.137- Prefill vs decode latency and output-token cost share.138- KV-cache budget, eviction, and deployment config.139- Provider-specific usage fields and docs freshness.140- ROI assumptions across static, dynamic, and output tokens.141- CI/smoke-test readiness for stable prefix drift.142143## Bundled Scripts144145The skill includes small dependency-free helpers for repeatable audits:146147```bash148python3 audit-prompt-caching/scripts/extract_llm_calls.py .149python3 audit-prompt-caching/scripts/layout_linter.py fixtures/layout/good_openai_request.json150python3 audit-prompt-caching/scripts/prefix_stability_check.py before.json after.json151python3 audit-prompt-caching/scripts/analyze_usage_logs.py usage.jsonl152python3 audit-prompt-caching/scripts/analyze_usage_logs.py --jsonl-normalized usage.jsonl153python3 audit-prompt-caching/scripts/estimate_cache_roi.py \154 --static-tokens 9000 \155 --dynamic-tokens 300 \156 --output-tokens 2000 \157 --requests 100 \158 --hit-rate 0.8 \159 --input-price-per-mtok 2.0 \160 --cached-input-price-per-mtok 0.2 \161 --output-price-per-mtok 8.0162python3 audit-prompt-caching/scripts/render_audit_report.py \163 --usage-log fixtures/openai/repeated_prefix_usage.jsonl \164 --provider openai \165 --engine "Responses API" \166 --finding "fixtures/openai/repeated_prefix_usage.jsonl:1 | low | openai | cold request has zero cached tokens | first request pays full prefill | warm repeated prefix before measuring steady state | confirm warm cached_tokens increase"167python3 audit-prompt-caching/scripts/validate_skill_package.py audit-prompt-caching168python3 audit-prompt-caching/scripts/run_trigger_eval.py audit-prompt-caching169```170171`prefix_stability_check.py` compares raw bytes by default so JSON key-order drift is visible. Use `--canonical-json` only when sorted-key normalization is intentional.172173Provider usage metadata and billing exports remain authoritative; these scripts are audit aids.174175## Example Prompts176177Use these as pressure scenarios, not generic smoke tests.178179OpenAI-compatible wrapper ambiguity:180181```text182Use $audit-prompt-caching to review this app. It imports the OpenAI SDK, but base_url points to https://openrouter.ai/api/v1. We added prompt_cache_key, provider.order, and openrouter/auto; cache_write_tokens appears, but cached_tokens stays zero. Decide whether this is an OpenAI issue or a router/cache-locality issue.183```184185Claude automatic caching writes every request:186187```text188Use $audit-prompt-caching to audit our Claude layout. We added top-level cache_control to an 18k-token policy prompt, then append timestamp and user question as the final content block. usage.cache_creation_input_tokens increments every request, but cache_read_input_tokens stays zero.189```190191Bedrock Converse cross-region cachePoint:192193```text194Use $audit-prompt-caching to review this Bedrock Converse request. cachePoint is placed after a user-specific intro, tools differ by route, CacheWriteInputTokens is high, CacheReadInputTokens is near zero, and some traffic uses cross-region inference.195```196197MCP tool registry drift:198199```text200Use $audit-prompt-caching to audit our coding agent. The MCP tool registry is queried every step, tool order changes with plugin load timing, read-only mode removes write tools, and compaction rewrites the first user turn. Costs rose even though each step sends fewer tools.201```202203vLLM/SGLang multi-replica KV:204205```text206Use $audit-prompt-caching to inspect this self-hosted deployment. vLLM/SGLang replicas sit behind a generic gateway, p99 prompt length is 12k, max_model_len is 128k, prefix hashes look stable, but TTFT spikes after scaling and prefix-cache metrics vary by replica.207```208209High cached tokens, low savings:210211```text212Use $audit-prompt-caching to explain why this workload still costs too much. cached_tokens is high and TTFT improved, but responses average 4k output tokens, tool calls add seconds, TPM errors did not improve, and finance wants to know whether prompt caching is the wrong lever.213```214215## Structure216217```text218audit-prompt-caching/219 SKILL.md220 agents/openai.yaml221 references/222 openai.md223 openrouter.md224 azure-openai.md225 anthropic.md226 bedrock.md227 agent-tools.md228 sglang.md229 vllm.md230 deepseek.md231 economics.md232 gemini.md233 mechanics.md234 predeploy-checklist.md235 report-template.md236 qwen.md237 yandexgpt.md238 zai.md239 use-cases.md240 scripts/241 analyze_usage_logs.py242 estimate_cache_roi.py243 extract_llm_calls.py244 layout_linter.py245 prefix_stability_check.py246 render_audit_report.py247 validate_skill_package.py248 run_trigger_eval.py249 evals/250 evals.json251 trigger_eval.json252fixtures/253 openai/254 anthropic/255 bedrock/256 openrouter/257 vllm/258 expected/259```260261## Validation262263Validate the skill package with the bundled validator:264265```bash266python3 audit-prompt-caching/scripts/validate_skill_package.py audit-prompt-caching267python3 audit-prompt-caching/scripts/run_trigger_eval.py audit-prompt-caching268```269270The repository also includes JSON eval prompts:271272- `audit-prompt-caching/evals/evals.json`: behavioral audit scenarios.273- `audit-prompt-caching/evals/trigger_eval.json`: should-trigger and should-not-trigger queries.274275Run the local script/package tests:276277```bash278python3 -m unittest tests/test_prompt_cache_scripts.py279```280281These evals are a starting point. A full proof cycle should still compare baseline agent behavior against behavior with the skill enabled.282283## Project Quality Gates284285CI runs the unittest suite, package validator, trigger eval, Python syntax compile, whitespace check, and generated-bytecode guard. Keep new scripts stdlib-only and add fixture-backed tests for behavior changes.286287## Freshness Policy288289Provider cache behavior changes. The skill treats bundled provider references as heuristics and instructs the agent to verify official docs before exact claims about pricing, TTL, model support, field names, cache-control semantics, or routing hints.290291## License292293MIT. See `LICENSE`.294
Full transparency — inspect the skill content before installing.