Published on

Prompt Caching Quick Reference Cheat Sheet

Authors
  • avatar
    Name
    Anablock
    Twitter

    AI Insights & Innovations

Claude Caching 2

Prompt Caching Quick Reference Cheat Sheet

šŸš€ Quick Start

Enable Caching (TypeScript)

const response = await client.messages.create({
  model: 'claude-3-5-sonnet-20241022',
  max_tokens: 1024,
  system: [
    {
      type: 'text',
      text: 'Your system prompt...',
      cache_control: { type: 'ephemeral' }  // ← Add this
    }
  ],
  messages: [...]
});

Enable Caching (Python)

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "Your system prompt...",
            "cache_control": {"type": "ephemeral"}  # ← Add this
        }
    ],
    messages=[...]
)

šŸ“Š Key Metrics

| Metric | Value | |--------|-------| | Cache Lifetime | 60 minutes | | Minimum Tokens | 1,024 tokens | | Max Breakpoints | 4 per request | | Speed Improvement | Up to 85% faster | | Cost Reduction | ~90% on cached tokens |


šŸ’° Pricing (Claude 3.5 Sonnet)

| Type | Cost per 1M tokens | |------|-------------------| | Regular Input | $3.00 | | Cache Write | $3.75 (+25%) | | Cache Read | $0.30 (-90%) | | Output | $15.00 |


šŸŽÆ When to Use Caching

āœ… Good Use Cases:

  • Same document, multiple questions
  • Consistent system prompts
  • Repeated tool definitions
  • Multi-turn conversations
  • Batch processing with shared context

āŒ Poor Use Cases:

  • One-off requests
  • Constantly changing content
  • Very short prompts (<1024 tokens)
  • Requests >1 hour apart

šŸ”§ Common Patterns

Pattern 1: Cache System Prompt + Tools

// TypeScript
const tools = [...];  // Your tools
const toolsWithCache = tools.map((tool, idx) => 
  idx === tools.length - 1 
    ? { ...tool, cache_control: { type: 'ephemeral' } }
    : tool
);

const response = await client.messages.create({
  system: [
    {
      type: 'text',
      text: systemPrompt,
      cache_control: { type: 'ephemeral' }
    }
  ],
  tools: toolsWithCache,
  messages: [...]
});
# Python
def add_cache_to_tools(tools):
    tools_clone = tools.copy()
    last_tool = tools_clone[-1].copy()
    last_tool["cache_control"] = {"type": "ephemeral"}
    tools_clone[-1] = last_tool
    return tools_clone

response = client.messages.create(
    system=[
        {
            "type": "text",
            "text": system_prompt,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    tools=add_cache_to_tools(tools),
    messages=[...]
)

Pattern 2: Cache Document in Messages

// TypeScript
const messages = [
  {
    role: 'user',
    content: [
      {
        type: 'text',
        text: `Document:\n\n${document}`,
        cache_control: { type: 'ephemeral' }  // Cached
      },
      {
        type: 'text',
        text: question  // Not cached - changes each request
      }
    ]
  }
];
# Python
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": f"Document:\n\n{document}",
                "cache_control": {"type": "ephemeral"}  # Cached
            },
            {
                "type": "text",
                "text": question  # Not cached
            }
        ]
    }
]

Pattern 3: Multi-Level Caching

// Cache hierarchy: Tools → System → Document → Question
const response = await client.messages.create({
  tools: addCacheToLastTool(tools),  // Breakpoint 1
  system: [
    {
      type: 'text',
      text: systemPrompt,
      cache_control: { type: 'ephemeral' }  // Breakpoint 2
    }
  ],
  messages: [
    {
      role: 'user',
      content: [
        {
          type: 'text',
          text: document,
          cache_control: { type: 'ephemeral' }  // Breakpoint 3
        },
        {
          type: 'text',
          text: question  // Not cached
        }
      ]
    }
  ]
});

šŸ“ˆ Reading Cache Metrics

Response Usage Object

{
  "usage": {
    "input_tokens": 750,
    "cache_creation_input_tokens": 1772,  // First request
    "cache_read_input_tokens": 1772,      // Subsequent requests
    "output_tokens": 350
  }
}

What Each Means

| Field | Meaning | |-------|---------|| | input_tokens | Fresh tokens processed | | cache_creation_input_tokens | Tokens written to cache (first request) | | cache_read_input_tokens | Tokens read from cache (subsequent) | | output_tokens | Response length |

Calculate Cache Hit Rate

const cacheHitRate = 
  (usage.cache_read_input_tokens / 
   (usage.input_tokens + usage.cache_read_input_tokens)) * 100;

console.log(`Cache hit rate: ${cacheHitRate.toFixed(1)}%`);

āš ļø Common Pitfalls

āŒ Pitfall 1: Content Changed Slightly

// Request 1
const doc = "Analyze this document.";

// Request 2
const doc = "Analyze this document!";  // Added "!" - cache miss!

Fix: Keep cached content identical.

āŒ Pitfall 2: Using Shorthand Format

// Wrong - no place for cache_control
messages: [
  { role: 'user', content: 'Hello' }
]

// Correct - longhand format
messages: [
  {
    role: 'user',
    content: [
      {
        type: 'text',
        text: 'Hello',
        cache_control: { type: 'ephemeral' }
      }
    ]
  }
]

āŒ Pitfall 3: Below Token Threshold

// Too short - won't cache
system: "You are helpful."  // ~50 tokens

// Better - combine elements to exceed 1024 tokens
system: [detailed_instructions] + [examples] + [guidelines]

āŒ Pitfall 4: Reordering Tools

// Request 1
tools = [toolA, toolB, toolC]

// Request 2
tools = [toolB, toolA, toolC]  // Different order - cache miss!

Fix: Maintain consistent tool order.


šŸ” Debugging Cache Issues

Check if Cache is Working

// TypeScript
function analyzeCacheUsage(usage: Anthropic.Usage): void {
  if (usage.cache_creation_input_tokens) {
    console.log('āœ“ Cache created:', usage.cache_creation_input_tokens, 'tokens');
  }
  
  if (usage.cache_read_input_tokens) {
    console.log('āœ“ Cache hit:', usage.cache_read_input_tokens, 'tokens');
  }
  
  if (!usage.cache_creation_input_tokens && !usage.cache_read_input_tokens) {
    console.log('āš ļø  No caching occurred - check:');
    console.log('  - Content is >1024 tokens');
    console.log('  - cache_control is set');
    console.log('  - Using longhand format');
  }
}
# Python
def analyze_cache_usage(usage):
    if usage.cache_creation_input_tokens:
        print(f"āœ“ Cache created: {usage.cache_creation_input_tokens} tokens")
    
    if usage.cache_read_input_tokens:
        print(f"āœ“ Cache hit: {usage.cache_read_input_tokens} tokens")
    
    if not usage.cache_creation_input_tokens and not usage.cache_read_input_tokens:
        print("āš ļø  No caching occurred - check:")
        print("  - Content is >1024 tokens")
        print("  - cache_control is set")
        print("  - Using longhand format")

šŸ’” Pro Tips

  1. Cache Warming: Make a dummy request to populate cache before real traffic
  2. Refresh Before Expiry: Schedule cache refresh every 50 minutes
  3. Monitor Hit Rates: Track cache_read_input_tokens to measure effectiveness
  4. Use Defensive Copying: Clone tools/prompts before adding cache control
  5. Combine with Redis: Two-tier caching for maximum efficiency

šŸ“š Processing Order

Claude processes components in this order:

  1. Tools (if provided)
  2. System prompt (if provided)
  3. Messages (in order)

Place breakpoints strategically based on what changes:

  • Tools rarely change → cache first
  • System prompt changes daily → cache second
  • Messages change per request → cache selectively

šŸŽ“ Quick Decision Tree

Do you send the same content repeatedly?
ā”œā”€ No → Don't use caching
└─ Yes
   ā”œā”€ Is content >1024 tokens?
   │  ā”œā”€ No → Combine elements or skip caching
   │  └─ Yes
   │     ā”œā”€ Requests within 1 hour?
   │     │  ā”œā”€ No → Consider cache warming
   │     │  └─ Yes → āœ… Use caching!

šŸ”— Quick Links


šŸ“‹ Checklist for Production

  • [ ] System prompt uses longhand format with cache_control
  • [ ] Tools have cache control on last tool
  • [ ] Content exceeds 1,024 token minimum
  • [ ] Cached content is identical across requests
  • [ ] Monitoring cache hit rates
  • [ ] Handling cache expiry gracefully
  • [ ] Testing with real usage patterns
  • [ ] Documented caching strategy for team

Print this cheat sheet and keep it handy while implementing prompt caching!