- Published on
Prompt Caching Quick Reference Cheat Sheet
- Authors

- Name
- Anablock
AI Insights & Innovations

Prompt Caching Quick Reference Cheat Sheet
š Quick Start
Enable Caching (TypeScript)
const response = await client.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 1024,
system: [
{
type: 'text',
text: 'Your system prompt...',
cache_control: { type: 'ephemeral' } // ā Add this
}
],
messages: [...]
});
Enable Caching (Python)
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
system=[
{
"type": "text",
"text": "Your system prompt...",
"cache_control": {"type": "ephemeral"} # ā Add this
}
],
messages=[...]
)
š Key Metrics
| Metric | Value | |--------|-------| | Cache Lifetime | 60 minutes | | Minimum Tokens | 1,024 tokens | | Max Breakpoints | 4 per request | | Speed Improvement | Up to 85% faster | | Cost Reduction | ~90% on cached tokens |
š° Pricing (Claude 3.5 Sonnet)
| Type | Cost per 1M tokens | |------|-------------------| | Regular Input | $3.00 | | Cache Write | $3.75 (+25%) | | Cache Read | $0.30 (-90%) | | Output | $15.00 |
šÆ When to Use Caching
ā Good Use Cases:
- Same document, multiple questions
- Consistent system prompts
- Repeated tool definitions
- Multi-turn conversations
- Batch processing with shared context
ā Poor Use Cases:
- One-off requests
- Constantly changing content
- Very short prompts (<1024 tokens)
- Requests >1 hour apart
š§ Common Patterns
Pattern 1: Cache System Prompt + Tools
// TypeScript
const tools = [...]; // Your tools
const toolsWithCache = tools.map((tool, idx) =>
idx === tools.length - 1
? { ...tool, cache_control: { type: 'ephemeral' } }
: tool
);
const response = await client.messages.create({
system: [
{
type: 'text',
text: systemPrompt,
cache_control: { type: 'ephemeral' }
}
],
tools: toolsWithCache,
messages: [...]
});
# Python
def add_cache_to_tools(tools):
tools_clone = tools.copy()
last_tool = tools_clone[-1].copy()
last_tool["cache_control"] = {"type": "ephemeral"}
tools_clone[-1] = last_tool
return tools_clone
response = client.messages.create(
system=[
{
"type": "text",
"text": system_prompt,
"cache_control": {"type": "ephemeral"}
}
],
tools=add_cache_to_tools(tools),
messages=[...]
)
Pattern 2: Cache Document in Messages
// TypeScript
const messages = [
{
role: 'user',
content: [
{
type: 'text',
text: `Document:\n\n${document}`,
cache_control: { type: 'ephemeral' } // Cached
},
{
type: 'text',
text: question // Not cached - changes each request
}
]
}
];
# Python
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": f"Document:\n\n{document}",
"cache_control": {"type": "ephemeral"} # Cached
},
{
"type": "text",
"text": question # Not cached
}
]
}
]
Pattern 3: Multi-Level Caching
// Cache hierarchy: Tools ā System ā Document ā Question
const response = await client.messages.create({
tools: addCacheToLastTool(tools), // Breakpoint 1
system: [
{
type: 'text',
text: systemPrompt,
cache_control: { type: 'ephemeral' } // Breakpoint 2
}
],
messages: [
{
role: 'user',
content: [
{
type: 'text',
text: document,
cache_control: { type: 'ephemeral' } // Breakpoint 3
},
{
type: 'text',
text: question // Not cached
}
]
}
]
});
š Reading Cache Metrics
Response Usage Object
{
"usage": {
"input_tokens": 750,
"cache_creation_input_tokens": 1772, // First request
"cache_read_input_tokens": 1772, // Subsequent requests
"output_tokens": 350
}
}
What Each Means
| Field | Meaning |
|-------|---------||
| input_tokens | Fresh tokens processed |
| cache_creation_input_tokens | Tokens written to cache (first request) |
| cache_read_input_tokens | Tokens read from cache (subsequent) |
| output_tokens | Response length |
Calculate Cache Hit Rate
const cacheHitRate =
(usage.cache_read_input_tokens /
(usage.input_tokens + usage.cache_read_input_tokens)) * 100;
console.log(`Cache hit rate: ${cacheHitRate.toFixed(1)}%`);
ā ļø Common Pitfalls
ā Pitfall 1: Content Changed Slightly
// Request 1
const doc = "Analyze this document.";
// Request 2
const doc = "Analyze this document!"; // Added "!" - cache miss!
Fix: Keep cached content identical.
ā Pitfall 2: Using Shorthand Format
// Wrong - no place for cache_control
messages: [
{ role: 'user', content: 'Hello' }
]
// Correct - longhand format
messages: [
{
role: 'user',
content: [
{
type: 'text',
text: 'Hello',
cache_control: { type: 'ephemeral' }
}
]
}
]
ā Pitfall 3: Below Token Threshold
// Too short - won't cache
system: "You are helpful." // ~50 tokens
// Better - combine elements to exceed 1024 tokens
system: [detailed_instructions] + [examples] + [guidelines]
ā Pitfall 4: Reordering Tools
// Request 1
tools = [toolA, toolB, toolC]
// Request 2
tools = [toolB, toolA, toolC] // Different order - cache miss!
Fix: Maintain consistent tool order.
š Debugging Cache Issues
Check if Cache is Working
// TypeScript
function analyzeCacheUsage(usage: Anthropic.Usage): void {
if (usage.cache_creation_input_tokens) {
console.log('ā Cache created:', usage.cache_creation_input_tokens, 'tokens');
}
if (usage.cache_read_input_tokens) {
console.log('ā Cache hit:', usage.cache_read_input_tokens, 'tokens');
}
if (!usage.cache_creation_input_tokens && !usage.cache_read_input_tokens) {
console.log('ā ļø No caching occurred - check:');
console.log(' - Content is >1024 tokens');
console.log(' - cache_control is set');
console.log(' - Using longhand format');
}
}
# Python
def analyze_cache_usage(usage):
if usage.cache_creation_input_tokens:
print(f"ā Cache created: {usage.cache_creation_input_tokens} tokens")
if usage.cache_read_input_tokens:
print(f"ā Cache hit: {usage.cache_read_input_tokens} tokens")
if not usage.cache_creation_input_tokens and not usage.cache_read_input_tokens:
print("ā ļø No caching occurred - check:")
print(" - Content is >1024 tokens")
print(" - cache_control is set")
print(" - Using longhand format")
š” Pro Tips
- Cache Warming: Make a dummy request to populate cache before real traffic
- Refresh Before Expiry: Schedule cache refresh every 50 minutes
- Monitor Hit Rates: Track
cache_read_input_tokensto measure effectiveness - Use Defensive Copying: Clone tools/prompts before adding cache control
- Combine with Redis: Two-tier caching for maximum efficiency
š Processing Order
Claude processes components in this order:
- Tools (if provided)
- System prompt (if provided)
- Messages (in order)
Place breakpoints strategically based on what changes:
- Tools rarely change ā cache first
- System prompt changes daily ā cache second
- Messages change per request ā cache selectively
š Quick Decision Tree
Do you send the same content repeatedly?
āā No ā Don't use caching
āā Yes
āā Is content >1024 tokens?
ā āā No ā Combine elements or skip caching
ā āā Yes
ā āā Requests within 1 hour?
ā ā āā No ā Consider cache warming
ā ā āā Yes ā ā
Use caching!
š Quick Links
š Checklist for Production
- [ ] System prompt uses longhand format with
cache_control - [ ] Tools have cache control on last tool
- [ ] Content exceeds 1,024 token minimum
- [ ] Cached content is identical across requests
- [ ] Monitoring cache hit rates
- [ ] Handling cache expiry gracefully
- [ ] Testing with real usage patterns
- [ ] Documented caching strategy for team
Print this cheat sheet and keep it handy while implementing prompt caching!