How do I install LLM Cache Audit Skill?

Install LLM Cache Audit Skill with a single command: npx mdskills install sernote/audit-prompt-caching. This downloads the skill files into your project and your AI agent picks them up automatically.

What platforms support LLM Cache Audit Skill?

LLM Cache Audit Skill works with Claude Code, Claude Desktop, Cursor, Vscode Copilot, Windsurf, Continue Dev, Codex, Gemini Cli, Amp, Roo Code, Goose, Opencode, Trae, Qodo, Command Code. Skills use the open SKILL.md format which is compatible with any AI coding agent that reads markdown instructions.

← Back to skills

LLM Cache Audit Skill

Name: LLM Cache Audit Skill: AI Agent Skill
Brand: sernote
Availability: InStock
Author: sernote

ProductivityIntermediate

audit-prompt-caching is a portable Codex/agent skill for finding why LLM cache reuse fails across the request path: prompt/prefix caches, provider cache telemetry, cache-aware routing, agent tool stability, Bedrock checkpoints, OpenRouter routing drift, provider migration risk, and vLLM/SGLang KV reuse. LLM cache reuse usually fails silently. A timestamp in the system prompt, shuffled tool schemas

by @sernoteUpdated 4/28/2026

Add this skill

npx mdskills install sernote/audit-prompt-caching

Fork & Edit

SKILL.md

Edit in Browser

1# LLM Cache Audit Skill
2 
3[![CI](https://github.com/sernote/audit-prompt-caching/actions/workflows/ci.yml/badge.svg)](https://github.com/sernote/audit-prompt-caching/actions/workflows/ci.yml)
4[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)
5![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue)
6![Stdlib only](https://img.shields.io/badge/scripts-stdlib--only-green)
7![Codex skill](https://img.shields.io/badge/Codex-skill-compatible-black)
8 
9`audit-prompt-caching` is a portable Codex/agent skill for finding why LLM cache reuse fails across the request path: prompt/prefix caches, provider cache telemetry, cache-aware routing, agent tool stability, Bedrock checkpoints, OpenRouter routing drift, provider migration risk, and vLLM/SGLang KV reuse.
10 
11## Why This Exists
12 
13LLM cache reuse usually fails silently. A timestamp in the system prompt, shuffled tool schemas, a changed first user message, an OpenRouter fallback, or a new vLLM replica can turn repeated 20k-token requests into cold prefill again.
14 
15That failure is expensive because it often looks like a generic "LLM cost went up" or "agents got slower" incident. This skill gives agents a cache-specific audit path: inspect prefix stability, provider semantics, cache telemetry, routing locality, KV pressure, and whether caching is even the right lever.
16 
17## Quick Start
18 
19Run the fixture audit locally:
20 
21```bash
22git clone --depth 1 https://github.com/sernote/audit-prompt-caching.git
23cd audit-prompt-caching
24python3 audit-prompt-caching/scripts/analyze_usage_logs.py \
25  fixtures/openai/repeated_prefix_usage.jsonl
26```
27 
28Render a report from the same fixture:
29 
30```bash
31python3 audit-prompt-caching/scripts/render_audit_report.py \
32  --usage-log fixtures/openai/repeated_prefix_usage.jsonl \
33  --provider openai \
34  --engine "Responses API" \
35  --finding "fixtures/openai/repeated_prefix_usage.jsonl:1 | low | openai | cold request has zero cached tokens | first request pays full prefill | warm repeated prefix before measuring steady state | confirm warm cached_tokens increase"
36```
37 
38Install as a Codex skill from GitHub:
39 
40```bash
41tmp="$(mktemp -d)" && \
42git clone --depth 1 https://github.com/sernote/audit-prompt-caching.git "$tmp" && \
43mkdir -p ~/.codex/skills && \
44rm -rf ~/.codex/skills/audit-prompt-caching && \
45cp -R "$tmp/audit-prompt-caching" ~/.codex/skills/audit-prompt-caching && \
46rm -rf "$tmp"
47```
48 
49Then start a new Codex session and ask:
50 
51```text
52Use $audit-prompt-caching to audit this OpenAI app. cached_tokens stays at 0 even though the system prompt is 8k tokens.
53```
54 
55## Audit Hero Shot
56 
57```text
58+------------------------------------------------------------+
59| LLM CACHE AUDIT                                            |
60+------------------------------------------------------------+
61| Provider/API: openai / Responses API                       |
62| Cache hit ratio: 59.62%                                    |
63| Output share: 7.17%                                        |
64| Main blocker: cold request has zero cached tokens           |
65| Cache impact: first request pays full prefill               |
66| Fix: warm repeated prefix before measuring steady state     |
67| Validate: confirm cached-token fields and TTFT improve      |
68+------------------------------------------------------------+
69```
70 
71## Fixture Signal
72 
73The bundled OpenAI fixture is synthetic and safe to share, but it is still executable evidence:
74 
75| Signal | Value |
76|---|---:|
77| Records reviewed | 3 |
78| Input tokens | 15,600 |
79| Cached tokens | 9,300 |
80| Cache hit ratio | 59.62% |
81| Output share | 7.17% |
82 
83Example ROI model for 1,000 requests with 9k static input tokens, 300 dynamic input tokens, 2k output tokens, 71% cache hit rate, and explicit sample prices:
84 
85```text
86Total cost: $34.60 -> $23.10
87Total savings: 33.24%
88Input savings: 61.84%
89```
90 
91These are fixture numbers, not a production guarantee. Always validate with your provider usage fields and billing export.
92 
93## Cache Flow
94 
95```mermaid
96flowchart LR
97  A["stable tools / schemas"] --> B["stable system / developer instructions"]
98  B --> C["few-shot examples / static docs"]
99  C --> D["append-only conversation anchor"]
100  D --> E["late dynamic user data"]
101  A --> H["prefix + tool + schema hash"]
102  H --> I["provider cache read/write fields"]
103  I --> J["TTFT / cost / route metrics"]
104```
105 
106## Positioning
107 
108This project is a static audit skill plus dependency-free local scripts. It complements runtime observability and gateway tools rather than replacing them.
109 
110| Project | Primary job | Static cache-path audit | Portable agent skill | Stdlib-only local scripts |
111|---|---|---:|---:|---:|
112| `audit-prompt-caching` | Cross-provider prompt/prefix/KV cache audit | yes | yes | yes |
113| [ussumant/cache-audit](https://github.com/ussumant/cache-audit) | Claude Code cache-rules skill | Claude-focused | Claude Code-focused | single skill |
114| [Helicone](https://github.com/Helicone/helicone) | LLM observability and gateway | runtime-oriented | no | no |
115| [Langfuse](https://github.com/langfuse/langfuse) | LLM observability, evals, prompt management | runtime-oriented | no | no |
116| [LiteLLM](https://github.com/BerriAI/litellm) | LLM gateway/proxy | runtime/gateway-oriented | no | no |
117 
118## Who It Is For
119 
120- AI engineers debugging prompt-cache misses or long TTFT.
121- Backend engineers building LLM request paths.
122- Agent developers working with tools, MCP, compaction, or coding assistants.
123- Platform/SRE engineers running vLLM, SGLang, or multi-replica inference.
124- Teams comparing providers or estimating effective LLM cost.
125 
126## What It Audits
127 
128- Prompt-cache applicability before recommending changes.
129- Stable prompt prefix layout.
130- Volatile data in system prompts and early messages.
131- Non-deterministic tool/schema serialization.
132- Dynamic tool sets inside agent loops.
133- History truncation, compaction, and summarization.
134- Cache-aware routing for managed and self-hosted inference.
135- OpenRouter sticky routing, provider fallback, and cache read/write fields.
136- Amazon Bedrock cache checkpoints and read/write fields.
137- Prefill vs decode latency and output-token cost share.
138- KV-cache budget, eviction, and deployment config.
139- Provider-specific usage fields and docs freshness.
140- ROI assumptions across static, dynamic, and output tokens.
141- CI/smoke-test readiness for stable prefix drift.
142 
143## Bundled Scripts
144 
145The skill includes small dependency-free helpers for repeatable audits:
146 
147```bash
148python3 audit-prompt-caching/scripts/extract_llm_calls.py .
149python3 audit-prompt-caching/scripts/layout_linter.py fixtures/layout/good_openai_request.json
150python3 audit-prompt-caching/scripts/prefix_stability_check.py before.json after.json
151python3 audit-prompt-caching/scripts/analyze_usage_logs.py usage.jsonl
152python3 audit-prompt-caching/scripts/analyze_usage_logs.py --jsonl-normalized usage.jsonl
153python3 audit-prompt-caching/scripts/estimate_cache_roi.py \
154  --static-tokens 9000 \
155  --dynamic-tokens 300 \
156  --output-tokens 2000 \
157  --requests 100 \
158  --hit-rate 0.8 \
159  --input-price-per-mtok 2.0 \
160  --cached-input-price-per-mtok 0.2 \
161  --output-price-per-mtok 8.0
162python3 audit-prompt-caching/scripts/render_audit_report.py \
163  --usage-log fixtures/openai/repeated_prefix_usage.jsonl \
164  --provider openai \
165  --engine "Responses API" \
166  --finding "fixtures/openai/repeated_prefix_usage.jsonl:1 | low | openai | cold request has zero cached tokens | first request pays full prefill | warm repeated prefix before measuring steady state | confirm warm cached_tokens increase"
167python3 audit-prompt-caching/scripts/validate_skill_package.py audit-prompt-caching
168python3 audit-prompt-caching/scripts/run_trigger_eval.py audit-prompt-caching
169```
170 
171`prefix_stability_check.py` compares raw bytes by default so JSON key-order drift is visible. Use `--canonical-json` only when sorted-key normalization is intentional.
172 
173Provider usage metadata and billing exports remain authoritative; these scripts are audit aids.
174 
175## Example Prompts
176 
177Use these as pressure scenarios, not generic smoke tests.
178 
179OpenAI-compatible wrapper ambiguity:
180 
181```text
182Use $audit-prompt-caching to review this app. It imports the OpenAI SDK, but base_url points to https://openrouter.ai/api/v1. We added prompt_cache_key, provider.order, and openrouter/auto; cache_write_tokens appears, but cached_tokens stays zero. Decide whether this is an OpenAI issue or a router/cache-locality issue.
183```
184 
185Claude automatic caching writes every request:
186 
187```text
188Use $audit-prompt-caching to audit our Claude layout. We added top-level cache_control to an 18k-token policy prompt, then append timestamp and user question as the final content block. usage.cache_creation_input_tokens increments every request, but cache_read_input_tokens stays zero.
189```
190 
191Bedrock Converse cross-region cachePoint:
192 
193```text
194Use $audit-prompt-caching to review this Bedrock Converse request. cachePoint is placed after a user-specific intro, tools differ by route, CacheWriteInputTokens is high, CacheReadInputTokens is near zero, and some traffic uses cross-region inference.
195```
196 
197MCP tool registry drift:
198 
199```text
200Use $audit-prompt-caching to audit our coding agent. The MCP tool registry is queried every step, tool order changes with plugin load timing, read-only mode removes write tools, and compaction rewrites the first user turn. Costs rose even though each step sends fewer tools.
201```
202 
203vLLM/SGLang multi-replica KV:
204 
205```text
206Use $audit-prompt-caching to inspect this self-hosted deployment. vLLM/SGLang replicas sit behind a generic gateway, p99 prompt length is 12k, max_model_len is 128k, prefix hashes look stable, but TTFT spikes after scaling and prefix-cache metrics vary by replica.
207```
208 
209High cached tokens, low savings:
210 
211```text
212Use $audit-prompt-caching to explain why this workload still costs too much. cached_tokens is high and TTFT improved, but responses average 4k output tokens, tool calls add seconds, TPM errors did not improve, and finance wants to know whether prompt caching is the wrong lever.
213```
214 
215## Structure
216 
217```text
218audit-prompt-caching/
219  SKILL.md
220  agents/openai.yaml
221  references/
222    openai.md
223    openrouter.md
224    azure-openai.md
225    anthropic.md
226    bedrock.md
227    agent-tools.md
228    sglang.md
229    vllm.md
230    deepseek.md
231    economics.md
232    gemini.md
233    mechanics.md
234    predeploy-checklist.md
235    report-template.md
236    qwen.md
237    yandexgpt.md
238    zai.md
239    use-cases.md
240  scripts/
241    analyze_usage_logs.py
242    estimate_cache_roi.py
243    extract_llm_calls.py
244    layout_linter.py
245    prefix_stability_check.py
246    render_audit_report.py
247    validate_skill_package.py
248    run_trigger_eval.py
249  evals/
250    evals.json
251    trigger_eval.json
252fixtures/
253  openai/
254  anthropic/
255  bedrock/
256  openrouter/
257  vllm/
258  expected/
259```
260 
261## Validation
262 
263Validate the skill package with the bundled validator:
264 
265```bash
266python3 audit-prompt-caching/scripts/validate_skill_package.py audit-prompt-caching
267python3 audit-prompt-caching/scripts/run_trigger_eval.py audit-prompt-caching
268```
269 
270The repository also includes JSON eval prompts:
271 
272- `audit-prompt-caching/evals/evals.json`: behavioral audit scenarios.
273- `audit-prompt-caching/evals/trigger_eval.json`: should-trigger and should-not-trigger queries.
274 
275Run the local script/package tests:
276 
277```bash
278python3 -m unittest tests/test_prompt_cache_scripts.py
279```
280 
281These evals are a starting point. A full proof cycle should still compare baseline agent behavior against behavior with the skill enabled.
282 
283## Project Quality Gates
284 
285CI runs the unittest suite, package validator, trigger eval, Python syntax compile, whitespace check, and generated-bytecode guard. Keep new scripts stdlib-only and add fixture-backed tests for behavior changes.
286 
287## Freshness Policy
288 
289Provider cache behavior changes. The skill treats bundled provider references as heuristics and instructs the agent to verify official docs before exact claims about pricing, TTL, model support, field names, cache-control semantics, or routing hints.
290 
291## License
292 
293MIT. See `LICENSE`.
294

Full transparency — inspect the skill content before installing.

New to skill.md files?

See what a SKILL.md file is, how to install one, and how it differs from AGENTS.md or cursorrules.

Read the guide →