Anablock RAG

Processing PDFs with Claude: From Simple Extraction to Intelligent Document Analysis

Introduction

What if you could send a 50-page contract, research paper, or financial report to an AI and get instant analysis, summaries, or answers to specific questions? Claude's PDF processing capabilities make this possible—and it's surprisingly straightforward to implement.

Unlike traditional PDF parsing tools that struggle with complex layouts, embedded images, or tables, Claude can understand the entire document holistically: text, images, charts, tables, and their relationships. Whether you're building a document Q&A system, automating contract review, or extracting structured data from reports, Claude's PDF capabilities provide a powerful foundation.

In this article, we'll explore how to process PDFs with Claude, the technical details you need to know, and practical patterns for common document processing tasks.

PDF Processing Basics

How It Works

Claude's PDF processing works similarly to image processing, but with a few key differences. Instead of sending an image block, you send a document block with the PDF file encoded as base64.

Here's the basic structure:

import base64

# Read and encode the PDF
with open("document.pdf", "rb") as f:
    file_bytes = base64.standard_b64encode(f.read()).decode("utf-8")

# Create message with document block
messages = []

add_user_message(
    messages,
    [
        {
            "type": "document",
            "source": {
                "type": "base64",
                "media_type": "application/pdf",
                "data": file_bytes,
            },
        },
        {
            "type": "text",
            "text": "Summarize this document in one sentence."
        },
    ],
)

# Send to Claude
response = chat(messages)

Key Differences from Image Processing

If you're already familiar with Claude's vision capabilities, adapting your code for PDFs is straightforward. Here are the changes:

| Element | Image Processing | PDF Processing | |---------|------------------|----------------| | File extension | .png, .jpg, .webp | .pdf | | Variable name | image_bytes | file_bytes (convention) | | Block type | "image" | "document" | | Media type | "image/png", "image/jpeg" | "application/pdf" |

Technical Constraints

Before diving into implementation, understand these limitations:

File Size:

Max 32 MB per PDF (significantly larger than the 5MB limit for images)
Max 100 documents across all messages in a single request

Page Limits:

PDFs are converted to images internally (1 page = 1 image)
Each page counts toward the 100-image limit per request
For a 50-page PDF, you can include up to 2 PDFs per request (50 pages × 2 = 100 images)

Token Calculation: Each page is converted to an image and consumes tokens based on dimensions:

tokens_per_page ≈ (width_px × height_px) / 750

For a standard letter-size page at 150 DPI:

1275×1650px ≈ 2,805 tokens per page
10-page document ≈ 28,000 tokens
50-page document ≈ 140,000 tokens

Supported Format:

Only PDF files are supported (no Word docs, PowerPoint, etc.)
Convert other formats to PDF first if needed

What Claude Can Extract from PDFs

Claude's PDF processing goes far beyond simple text extraction. It can understand and analyze:

1. Text Content

Body text, headings, footnotes, captions
Maintains understanding of document flow and context
Handles multi-column layouts

2. Images and Charts

Embedded images, diagrams, and photographs
Charts and graphs (bar, line, pie, scatter, etc.)
Can describe visual content and extract data from visualizations

3. Tables

Structured data in tables
Relationships between columns and rows
Can convert tables to JSON, CSV, or other formats

4. Document Structure

Sections and subsections
Lists and bullet points
Headers and footers
Page numbers and references

5. Complex Layouts

Multi-column text
Sidebars and callout boxes
Mixed text and image layouts

This makes Claude a one-stop solution for document intelligence—no need to chain together OCR, table extraction, and NLP tools.

Citations: Making Claude's Sources Transparent

Why Citations Matter

When Claude answers questions based on documents you provide, users might assume it's just drawing from its training data. But what if Claude could show exactly where it found specific information? That's where citations come in—a powerful feature that lets Claude reference specific parts of your source documents and show users exactly where each piece of information comes from.

Imagine asking Claude about how Earth's atmosphere formed and getting a detailed answer. Without citations, users have no way to verify the information or understand that Claude is actually referencing a specific document you provided. Citations solve this transparency problem by creating a clear trail from Claude's response back to your source material.

Enabling Citations

To enable citations, you need to modify your document message structure. Add two new fields to your document block:

{
    "type": "document",
    "source": {
        "type": "base64",
        "media_type": "application/pdf",
        "data": file_bytes,
    },
    "title": "earth.pdf",              # New: readable document name
    "citations": {"enabled": True}      # New: enable citation tracking
}

The title field gives your document a readable name, while citations: {"enabled": True} tells Claude to track where it finds information.

Understanding Citation Structure

When citations are enabled, Claude's response becomes more structured. Instead of simple text, you get data that includes citation information for each claim:

{
  "content": [
    {
      "type": "text",
      "text": "Earth's atmosphere is composed of 78% nitrogen and 21% oxygen",
      "citations": [
        {
          "type": "document_citation",
          "cited_text": "The atmosphere of Earth is composed of nitrogen (78%), oxygen (21%), argon (0.9%), and trace amounts of other gases.",
          "document_index": 0,
          "document_title": "earth.pdf",
          "start_page_number": 3,
          "end_page_number": 3
        }
      ]
    }
  ]
}

Each citation contains several key pieces of information:

cited_text - The exact text from your document that supports Claude's statement
document_index - Which document Claude is referencing (useful when you provide multiple documents)
document_title - The title you assigned to the document
start_page_number - Where the cited text begins
end_page_number - Where the cited text ends

Implementation with Citations

Here's how to implement document Q&A with citations enabled:

def ask_with_citations(pdf_path, question):
    with open(pdf_path, "rb") as f:
        file_bytes = base64.standard_b64encode(f.read()).decode("utf-8")
    
    messages = []
    add_user_message(messages, [
        {
            "type": "document",
            "source": {
                "type": "base64",
                "media_type": "application/pdf",
                "data": file_bytes,
            },
            "title": os.path.basename(pdf_path),
            "citations": {"enabled": True}
        },
        {
            "type": "text",
            "text": f"""Answer this question based on the document:
            
            {question}
            
            Provide a clear, direct answer with specific details."""
        },
    ])
    
    # Get response
    response = chat(messages)
    
    # Parse and format response with citations
    result = {
        "answer": "",
        "citations": []
    }
    
    for block in response.content:
        if block.type == "text":
            result["answer"] += block.text
            
            if hasattr(block, 'citations') and block.citations:
                for i, citation in enumerate(block.citations, 1):
                    result["citations"].append({
                        "id": i,
                        "text": citation.cited_text,
                        "document": citation.document_title,
                        "pages": f"{citation.start_page_number}-{citation.end_page_number}"
                    })
    
    return result

# Example usage
result = document_qa_with_citations(
    "earth.pdf",
    "How did Earth's atmosphere form?"
)

print(f"Answer: {result['answer']}")
print(f"\nCitations:")
for citation in result['citations']:
    print(f"  [{citation['id']}] {citation['document']}, pages {citation['pages']}")
    print(f"      \"{citation['text'][:100]}...\"")

Sample output:

Answer: Earth's atmosphere formed through volcanic outgassing approximately 4.5 billion years ago, releasing water vapor, carbon dioxide, and nitrogen, which were later transformed by photosynthetic organisms that produced oxygen.

Citations:
  [1] earth.pdf, pages 5-5
      "Volcanic outgassing released water vapor, carbon dioxide, nitrogen, and other gases from Earth's interior..."
  [2] earth.pdf, pages 6-6
      "The appearance of photosynthetic organisms around 2.5 billion years ago began producing oxygen..."

Parsing Citation Responses

Extract and display citations from Claude's response:

def parse_citations(response):
    """Extract text and citations from response"""
    results = []
    
    for block in response.content:
        if block.type == "text":
            text_segment = {
                "text": block.text,
                "citations": []
            }
            
            # Extract citations if present
            if hasattr(block, 'citations') and block.citations:
                for citation in block.citations:
                    text_segment["citations"].append({
                        "cited_text": citation.cited_text,
                        "document": citation.document_title,
                        "pages": f"{citation.start_page_number}-{citation.end_page_number}"
                    })
            
            results.append(text_segment)
    
    return results

Building User Interfaces with Citations

The real power of citations comes from building user interfaces that make this information accessible. You can create interactive elements where users can hover over citation markers to see exactly where information came from.

Example UI pattern:

<div class="answer">
  <p>
    Earth's atmosphere is composed of 78% nitrogen and 21% oxygen
    <sup class="citation" data-citation-id="1">[1]</sup>
  </p>
</div>

<div class="citation-tooltip" id="citation-1">
  <strong>Source:</strong> earth.pdf, page 3<br>
  <strong>Cited text:</strong> "The atmosphere of Earth is composed of nitrogen (78%), oxygen (21%), argon (0.9%), and trace amounts of other gases."
</div>

This creates a transparent experience where users can:

✅ See that Claude's answers are grounded in actual source material
✅ Verify the information by checking the original document
✅ Understand the context around each cited piece of information
✅ Navigate directly to the source page in the PDF

Citations with Plain Text

Citations aren't limited to PDF documents. You can also use them with plain text sources. When working with text, modify your document structure like this:

{
    "type": "document",
    "source": {
        "type": "text",
        "media_type": "text/plain",
        "data": article_text,
    },
    "title": "earth_article",
    "citations": {"enabled": True}
}

With plain text sources, instead of page numbers, you'll get character positions that pinpoint exactly where in the text Claude found each piece of information:

{
  "type": "document_citation",
  "cited_text": "Earth formed approximately 4.54 billion years ago",
  "document_index": 0,
  "document_title": "earth_article",
  "start_char": 1247,
  "end_char": 1295
}

When to Use Citations

Citations are particularly valuable when:

✅ Users need to verify information for accuracy or compliance
✅ Working with authoritative documents that users should be able to reference
✅ Transparency is critical for your application (legal, medical, financial domains)
✅ Users might want to explore the broader context around specific facts
✅ Building trust in AI-generated responses is important
✅ Regulatory requirements mandate source attribution

Best Practices for Citations

1. Always provide meaningful document titles:

# Bad
"title": "doc1.pdf"

# Good
"title": "Q4_2023_Financial_Report.pdf"

2. Display citations in a user-friendly format:

Use superscript numbers [1], [2] in the text
Provide hover tooltips with cited text
Link to the specific page in the PDF viewer

3. Handle missing citations gracefully:

if not result['citations']:
    logger.info("No citations found - Claude may be using general knowledge")
    # Optionally warn user that answer isn't grounded in provided document

4. Validate citation integrity:

def validate_citations(response, expected_document_title):
    """Ensure citations reference the correct document"""
    for block in response.content:
        if hasattr(block, 'citations'):
            for citation in block.citations:
                if citation.document_title != expected_document_title:
                    logger.warning(
                        f"Unexpected citation source: {citation.document_title}"
                    )

5. Store citations for audit trails:

def log_citation_audit(user_id, question, answer, citations):
    """Log Q&A with citations for compliance/audit purposes"""
    audit_record = {
        "timestamp": datetime.now().isoformat(),
        "user_id": user_id,
        "question": question,
        "answer": answer,
        "citations": citations,
        "source_documents": [c["document"] for c in citations]
    }
    
    # Store in audit log database
    save_audit_record(audit_record)

Common Use Cases and Patterns

1. Document Summarization

Task: Generate executive summaries of long documents

Pattern:

def summarize_pdf(pdf_path, length="one paragraph"):
    with open(pdf_path, "rb") as f:
        file_bytes = base64.standard_b64encode(f.read()).decode("utf-8")
    
    messages = []
    add_user_message(messages, [
        {
            "type": "document",
            "source": {
                "type": "base64",
                "media_type": "application/pdf",
                "data": file_bytes,
            },
        },
        {
            "type": "text",
            "text": f"""Provide a {length} summary of this document covering:
            
            1. Main topic and purpose
            2. Key findings or arguments
            3. Important conclusions or recommendations
            
            Focus on the most critical information a decision-maker needs to know."""
        },
    ])
    
    return chat(messages)

Example:

Input: 50-page research paper on climate change
Output: "This paper analyzes global temperature trends from 1950-2020, finding that average temperatures have risen 1.2°C with accelerating rates in the past two decades, and recommends immediate policy interventions to limit warming to 1.5°C by 2050."

2. Question Answering

Task: Answer specific questions about document content

Pattern:

def answer_question_from_pdf(pdf_path, question):
    with open(pdf_path, "rb") as f:
        file_bytes = base64.standard_b64encode(f.read()).decode("utf-8")
    
    messages = []
    add_user_message(messages, [
        {
            "type": "document",
            "source": {
                "type": "base64",
                "media_type": "application/pdf",
                "data": file_bytes,
            },
        },
        {
            "type": "text",
            "text": f"""Answer this question based on the document:

            {question}
            
            Provide a direct answer with specific references to relevant sections or page numbers if mentioned in the document."""
        },
    ])
    
    return chat(messages)

Example:

Question: "What is the recommended dosage for patients over 65?"
Answer: "For patients over 65, the recommended dosage is 50mg once daily, as stated in Section 4.2 (Dosage and Administration). The document notes this is half the standard adult dose due to reduced renal clearance in elderly patients."

3. Data Extraction

Task: Extract structured data from tables, forms, or reports

Pattern:

def extract_invoice_data(pdf_path):
    with open(pdf_path, "rb") as f:
        file_bytes = base64.standard_b64encode(f.read()).decode("utf-8")
    
    messages = []
    add_user_message(messages, [
        {
            "type": "document",
            "source": {
                "type": "base64",
                "media_type": "application/pdf",
                "data": file_bytes,
            },
        },
        {
            "type": "text",
            "text": """Extract the following information from this invoice:

            1. Invoice number and date
            2. Vendor name and address
            3. Customer name and address
            4. Line items (description, quantity, unit price, total)
            5. Subtotal, tax, and total amount
            6. Payment terms and due date
            
            Format your response as JSON with these exact keys:
            {
              "invoice_number": "",
              "invoice_date": "",
              "vendor": {"name": "", "address": ""},
              "customer": {"name": "", "address": ""},
              "line_items": [{"description": "", "quantity": 0, "unit_price": 0, "total": 0}],
              "subtotal": 0,
              "tax": 0,
              "total": 0,
              "payment_terms": "",
              "due_date": ""
            }
            
            If any field is not present, use null."""
        },
    ])
    
    return chat(messages)

4. Contract Analysis

Task: Review contracts for key terms, risks, and obligations

Pattern:

def analyze_contract(pdf_path):
    with open(pdf_path, "rb") as f:
        file_bytes = base64.standard_b64encode(f.read()).decode("utf-8")
    
    messages = []
    add_user_message(messages, [
        {
            "type": "document",
            "source": {
                "type": "base64",
                "media_type": "application/pdf",
                "data": file_bytes,
            },
        },
        {
            "type": "text",
            "text": """Analyze this contract and provide a structured review:

            1. PARTIES: Identify all parties to the agreement
            
            2. KEY TERMS:
               - Contract duration and renewal terms
               - Payment terms and amounts
               - Deliverables and deadlines
               - Termination conditions
            
            3. OBLIGATIONS:
               - List obligations for each party
               - Note any performance guarantees or SLAs
            
            4. RISK FACTORS:
               - Identify liability clauses
               - Note indemnification requirements
               - Flag any unusual or concerning terms
               - Highlight missing standard protections
            
            5. RECOMMENDATIONS:
               - Suggest areas for negotiation
               - Note any red flags requiring legal review
            
            Format each section clearly with bullet points."""
        },
    ])
    
    return chat(messages)

5. Compliance Checking

Task: Verify documents meet regulatory or policy requirements

Pattern:

def check_compliance(pdf_path, requirements):
    with open(pdf_path, "rb") as f:
        file_bytes = base64.standard_b64encode(f.read()).decode("utf-8")
    
    messages = []
    add_user_message(messages, [
        {
            "type": "document",
            "source": {
                "type": "base64",
                "media_type": "application/pdf",
                "data": file_bytes,
            },
        },
        {
            "type": "text",
            "text": f"""Check if this document meets the following requirements:

            {requirements}
            
            For each requirement:
            1. State whether it is MET, PARTIALLY MET, or NOT MET
            2. Provide specific evidence from the document
            3. If not met, explain what is missing
            
            Provide a final compliance score (percentage of requirements met)."""
        },
    ])
    
    return chat(messages)

6. Document Comparison

Task: Compare two versions of a document or two related documents

Pattern:

def compare_documents(pdf_path_1, pdf_path_2):
    with open(pdf_path_1, "rb") as f:
        file_bytes_1 = base64.standard_b64encode(f.read()).decode("utf-8")
    
    with open(pdf_path_2, "rb") as f:
        file_bytes_2 = base64.standard_b64encode(f.read()).decode("utf-8")
    
    messages = []
    add_user_message(messages, [
        {
            "type": "document",
            "source": {
                "type": "base64",
                "media_type": "application/pdf",
                "data": file_bytes_1,
            },
        },
        {
            "type": "text",
            "text": "This is the first document (original version)."
        },
        {
            "type": "document",
            "source": {
                "type": "base64",
                "media_type": "application/pdf",
                "data": file_bytes_2,
            },
        },
        {
            "type": "text",
            "text": """This is the second document (revised version).

            Compare these two documents and identify:
            
            1. ADDITIONS: New content in the second document
            2. DELETIONS: Content removed from the first document
            3. MODIFICATIONS: Content that changed between versions
            4. IMPACT ASSESSMENT: Which changes are substantive vs. cosmetic
            
            Focus on meaningful differences, not minor formatting changes."""
        },
    ])
    
    return chat(messages)

Real-World Example: Research Paper Analysis

Let's walk through a complete example: analyzing a Wikipedia article about Earth saved as a PDF.

The Task

Process a multi-page PDF containing a comprehensive article about Earth and extract key information.

Implementation

import base64

def analyze_earth_article(pdf_path):
    # Read and encode PDF
    with open(pdf_path, "rb") as f:
        file_bytes = base64.standard_b64encode(f.read()).decode("utf-8")
    
    # Create message
    messages = []
    add_user_message(messages, [
        {
            "type": "document",
            "source": {
                "type": "base64",
                "media_type": "application/pdf",
                "data": file_bytes,
            },
        },
        {
            "type": "text",
            "text": """Analyze this article about Earth and provide:

            1. A one-sentence summary
            2. The main topics covered (list)
            3. Any notable statistics or data points
            4. Key images or diagrams mentioned
            
            Keep your response concise and factual."""
        },
    ])
    
    # Get response
    response = chat(messages)
    return response

Sample Output

1. SUMMARY:
This article provides a comprehensive overview of Earth, covering its formation, structure, atmosphere, climate, and role as humanity's home planet.

2. MAIN TOPICS:
- Formation and geological history (4.5 billion years ago)
- Internal structure (core, mantle, crust)
- Atmosphere composition and layers
- Hydrosphere (oceans, water cycle)
- Biosphere and biodiversity
- Climate systems and patterns
- Human impact and environmental challenges

3. NOTABLE STATISTICS:
- Earth's radius: 6,371 km
- Surface area: 510 million km²
- 71% covered by water
- Atmosphere: 78% nitrogen, 21% oxygen
- Age: 4.54 billion years
- Only known planet with life

4. KEY VISUALS:
- Cross-section diagram showing Earth's layers
- Map of tectonic plates
- Atmospheric composition chart
- Water cycle illustration

This demonstrates Claude's ability to understand complex document structure, extract specific information, and present it in a structured format.

Best Practices for Production Systems

1. Optimize File Size

PDFs can be large. Compress them before sending to reduce token costs and latency:

from PyPDF2 import PdfReader, PdfWriter
import io

def compress_pdf(input_path, output_path, quality=50):
    """Compress PDF by reducing image quality"""
    reader = PdfReader(input_path)
    writer = PdfWriter()
    
    for page in reader.pages:
        # Compress images in page
        page.compress_content_streams()
        writer.add_page(page)
    
    with open(output_path, "wb") as f:
        writer.write(f)
    
    # Check size reduction
    original_size = os.path.getsize(input_path) / (1024 * 1024)  # MB
    compressed_size = os.path.getsize(output_path) / (1024 * 1024)  # MB
    
    print(f"Original: {original_size:.2f}MB → Compressed: {compressed_size:.2f}MB")
    print(f"Reduction: {((original_size - compressed_size) / original_size * 100):.1f}%")

2. Handle Large Documents

For PDFs exceeding 100 pages, split them into chunks:

from PyPDF2 import PdfReader, PdfWriter

def split_pdf(input_path, pages_per_chunk=50):
    """Split large PDF into smaller chunks"""
    reader = PdfReader(input_path)
    total_pages = len(reader.pages)
    
    chunks = []
    for i in range(0, total_pages, pages_per_chunk):
        writer = PdfWriter()
        
        # Add pages to chunk
        for page_num in range(i, min(i + pages_per_chunk, total_pages)):
            writer.add_page(reader.pages[page_num])
        
        # Save chunk to bytes
        chunk_bytes = io.BytesIO()
        writer.write(chunk_bytes)
        chunk_bytes.seek(0)
        
        chunks.append({
            "pages": f"{i+1}-{min(i + pages_per_chunk, total_pages)}",
            "data": base64.standard_b64encode(chunk_bytes.read()).decode("utf-8")
        })
    
    return chunks

# Process each chunk separately
def process_large_pdf(pdf_path, question):
    chunks = split_pdf(pdf_path, pages_per_chunk=50)
    
    results = []
    for chunk in chunks:
        messages = []
        add_user_message(messages, [
            {
                "type": "document",
                "source": {
                    "type": "base64",
                    "media_type": "application/pdf",
                    "data": chunk["data"],
                },
            },
            {
                "type": "text",
                "text": f"Pages {chunk['pages']}: {question}"
            },
        ])
        
        results.append(chat(messages))
    
    # Combine results
    return combine_chunk_results(results)

3. Cache Processed Documents

If you're querying the same document multiple times, cache the encoded PDF:

import hashlib
from functools import lru_cache

@lru_cache(maxsize=100)
def get_pdf_hash(pdf_path):
    """Generate hash of PDF for cache key"""
    with open(pdf_path, "rb") as f:
        return hashlib.sha256(f.read()).hexdigest()

def process_pdf_cached(pdf_path, query):
    cache_key = f"{get_pdf_hash(pdf_path)}_{hash(query)}"
    
    # Check cache
    if cache_key in pdf_cache:
        return pdf_cache[cache_key]
    
    # Process PDF
    result = process_pdf(pdf_path, query)
    
    # Store in cache
    pdf_cache[cache_key] = result
    return result

4. Validate PDF Files

Check PDFs before sending to avoid errors:

from PyPDF2 import PdfReader
import os

def validate_pdf(pdf_path):
    """Validate PDF before processing"""
    errors = []
    
    # Check file exists
    if not os.path.exists(pdf_path):
        errors.append(f"File not found: {pdf_path}")
        return False, errors
    
    # Check file size (max 32MB)
    file_size = os.path.getsize(pdf_path) / (1024 * 1024)  # MB
    if file_size > 32:
        errors.append(f"File too large: {file_size:.2f}MB (max 32MB)")
    
    # Check if valid PDF
    try:
        reader = PdfReader(pdf_path)
        page_count = len(reader.pages)
        
        # Check page count (max 100 pages per request)
        if page_count > 100:
            errors.append(f"Too many pages: {page_count} (max 100 per request)")
        
        # Check if encrypted
        if reader.is_encrypted:
            errors.append("PDF is encrypted - decrypt before processing")
        
    except Exception as e:
        errors.append(f"Invalid PDF: {str(e)}")
    
    return len(errors) == 0, errors

# Use in your processing pipeline
def safe_process_pdf(pdf_path, query):
    valid, errors = validate_pdf(pdf_path)
    
    if not valid:
        return {"error": "PDF validation failed", "details": errors}
    
    return process_pdf(pdf_path, query)

5. Monitor Token Usage

Track token consumption to optimize costs:

from PyPDF2 import PdfReader

def estimate_pdf_tokens(pdf_path):
    """Estimate token cost for processing a PDF"""
    reader = PdfReader(pdf_path)
    page_count = len(reader.pages)
    
    # Rough estimate: 2,800 tokens per page (standard letter size at 150 DPI)
    tokens_per_page = 2800
    estimated_tokens = page_count * tokens_per_page
    
    return {
        "pages": page_count,
        "estimated_tokens": estimated_tokens,
        "estimated_cost_usd": estimated_tokens * 0.000003  # Example rate
    }

# Log before processing
def process_pdf_with_logging(pdf_path, query):
    estimate = estimate_pdf_tokens(pdf_path)
    
    logger.info(f"Processing PDF: {pdf_path}")
    logger.info(f"Pages: {estimate['pages']}")
    logger.info(f"Estimated tokens: {estimate['estimated_tokens']:,}")
    logger.info(f"Estimated cost: ${estimate['estimated_cost_usd']:.4f}")
    
    # Alert if unusually expensive
    if estimate['estimated_tokens'] > 100000:
        logger.warning(f"High token usage: {estimate['estimated_tokens']:,} tokens")
    
    return process_pdf(pdf_path, query)

Common Pitfalls and Solutions

Pitfall 1: Sending Encrypted PDFs

Problem: Claude cannot process password-protected or encrypted PDFs.

Solution: Decrypt PDFs before processing:

from PyPDF2 import PdfReader, PdfWriter

def decrypt_pdf(input_path, output_path, password):
    reader = PdfReader(input_path)
    
    if reader.is_encrypted:
        reader.decrypt(password)
    
    writer = PdfWriter()
    for page in reader.pages:
        writer.add_page(page)
    
    with open(output_path, "wb") as f:
        writer.write(f)

Pitfall 2: Expecting Perfect OCR on Scanned Documents

Problem: Scanned PDFs (images of text) may have lower accuracy than native text PDFs.

Solution: For critical applications, pre-process scanned PDFs with dedicated OCR:

import pytesseract
from pdf2image import convert_from_path
from PyPDF2 import PdfWriter

def ocr_scanned_pdf(input_path, output_path):
    """Convert scanned PDF to searchable PDF"""
    # Convert PDF to images
    images = convert_from_path(input_path)
    
    # OCR each page
    writer = PdfWriter()
    for image in images:
        # Run OCR
        text = pytesseract.image_to_pdf_or_hocr(image, extension='pdf')
        # Add to output PDF
        # ... (implementation details)
    
    # Save searchable PDF
    with open(output_path, "wb") as f:
        writer.write(f)

Pitfall 3: Not Handling Multi-Document Requests

Problem: Sending multiple large PDFs in one request exceeds the 100-page limit.

Solution: Calculate total pages before sending:

def process_multiple_pdfs(pdf_paths, query):
    # Calculate total pages
    total_pages = sum(len(PdfReader(path).pages) for path in pdf_paths)
    
    if total_pages > 100:
        return {
            "error": "Too many pages",
            "total_pages": total_pages,
            "max_pages": 100,
            "suggestion": "Process PDFs separately or reduce page count"
        }
    
    # Process all PDFs together
    # ...

Pitfall 4: Vague Queries

Problem: Asking "What's in this document?" produces generic summaries.

Solution: Be specific about what you need:

# Bad
query = "Summarize this document"

# Good
query = """Summarize this contract focusing on:
1. Payment terms and amounts
2. Termination conditions
3. Liability limitations
4. Any unusual or non-standard clauses"""

Pitfall 5: Not Validating Structured Outputs

Problem: Expecting JSON but getting prose, or malformed JSON.

Solution: Validate and retry if needed:

import json

def extract_structured_data(pdf_path, schema):
    response = process_pdf(pdf_path, f"Extract data as JSON: {schema}")
    
    try:
        data = json.loads(response)
        # Validate against schema
        validate_schema(data, schema)
        return data
    except json.JSONDecodeError:
        # Retry with more explicit instructions
        response = process_pdf(
            pdf_path,
            f"Extract data as valid JSON only (no markdown, no explanation): {schema}"
        )
        return json.loads(response)

Conclusion

Claude's PDF processing capabilities transform document analysis from a multi-tool pipeline into a single, intelligent API call. Whether you're summarizing reports, extracting data, reviewing contracts, or answering questions, Claude can understand the full context of your documents—text, images, tables, and structure.

By implementing citations, you transform Claude from a "black box" that provides answers into a transparent research assistant that shows its work. This builds user trust and enables them to dive deeper into your source materials when needed.

Key Takeaways

✅ Simple implementation - Nearly identical to image processing with document blocks
✅ Comprehensive understanding - Handles text, images, tables, and complex layouts
✅ Flexible queries - From simple summaries to complex structured extraction
✅ Production-ready - With proper validation, caching, and error handling
✅ Cost-effective - ~2,800 tokens per page for standard documents
✅ Transparent with citations - Show users exactly where information comes from

Best Practices Recap

Validate PDFs before processing (size, encryption, page count)
Compress large files to reduce token costs
Split documents over 100 pages into chunks
Cache results for repeated queries on the same document
Be specific in your queries for better results
Validate outputs especially for structured data extraction
Monitor costs by tracking token usage per document
Enable citations when transparency and verifiability matter

When to Use PDF Processing

Use Claude's PDF processing when:

You need to understand document context, not just extract text
Documents contain mixed content (text + images + tables)
You want to ask questions or perform analysis, not just parse
You need flexibility in output format (summaries, JSON, Q&A, etc.)
Transparency and source attribution are important

Use traditional tools when:

You only need raw text extraction (use PyPDF2, pdfplumber)
You're processing thousands of documents in batch (cost considerations)
You need pixel-perfect layout preservation
Documents are extremely large (500+ pages)

Claude's PDF processing shines when you need intelligence, not just extraction. It understands context, relationships, and meaning—turning static documents into queryable knowledge with verifiable sources.

Processing PDFs with Claude: From Simple Extraction to Intelligent Document Analysis

Introduction

PDF Processing Basics

How It Works

Key Differences from Image Processing

Technical Constraints

What Claude Can Extract from PDFs

1. Text Content

2. Images and Charts

3. Tables

4. Document Structure

5. Complex Layouts

Citations: Making Claude's Sources Transparent

Why Citations Matter

Enabling Citations

Understanding Citation Structure

Implementation with Citations

Parsing Citation Responses

Building User Interfaces with Citations

Citations with Plain Text

When to Use Citations

Best Practices for Citations

Common Use Cases and Patterns

1. Document Summarization

2. Question Answering

3. Data Extraction

4. Contract Analysis

5. Compliance Checking

6. Document Comparison

Real-World Example: Research Paper Analysis

The Task

Implementation

Sample Output

Best Practices for Production Systems

1. Optimize File Size

2. Handle Large Documents

3. Cache Processed Documents

4. Validate PDF Files

5. Monitor Token Usage

Common Pitfalls and Solutions

Pitfall 1: Sending Encrypted PDFs

Pitfall 2: Expecting Perfect OCR on Scanned Documents

Pitfall 3: Not Handling Multi-Document Requests

Pitfall 4: Vague Queries

Pitfall 5: Not Validating Structured Outputs

Conclusion

Key Takeaways

Best Practices Recap

When to Use PDF Processing

Further Reading