- Published on
Processing PDFs with Claude: From Simple Extraction to Intelligent Document Analysis
- Authors

- Name
- Anablock
AI Insights & Innovations

Processing PDFs with Claude: From Simple Extraction to Intelligent Document Analysis
Introduction
What if you could send a 50-page contract, research paper, or financial report to an AI and get instant analysis, summaries, or answers to specific questions? Claude's PDF processing capabilities make this possible—and it's surprisingly straightforward to implement.
Unlike traditional PDF parsing tools that struggle with complex layouts, embedded images, or tables, Claude can understand the entire document holistically: text, images, charts, tables, and their relationships. Whether you're building a document Q&A system, automating contract review, or extracting structured data from reports, Claude's PDF capabilities provide a powerful foundation.
In this article, we'll explore how to process PDFs with Claude, the technical details you need to know, and practical patterns for common document processing tasks.
PDF Processing Basics
How It Works
Claude's PDF processing works similarly to image processing, but with a few key differences. Instead of sending an image block, you send a document block with the PDF file encoded as base64.
Here's the basic structure:
import base64
# Read and encode the PDF
with open("document.pdf", "rb") as f:
file_bytes = base64.standard_b64encode(f.read()).decode("utf-8")
# Create message with document block
messages = []
add_user_message(
messages,
[
{
"type": "document",
"source": {
"type": "base64",
"media_type": "application/pdf",
"data": file_bytes,
},
},
{
"type": "text",
"text": "Summarize this document in one sentence."
},
],
)
# Send to Claude
response = chat(messages)
Key Differences from Image Processing
If you're already familiar with Claude's vision capabilities, adapting your code for PDFs is straightforward. Here are the changes:
| Element | Image Processing | PDF Processing |
|---------|------------------|----------------|
| File extension | .png, .jpg, .webp | .pdf |
| Variable name | image_bytes | file_bytes (convention) |
| Block type | "image" | "document" |
| Media type | "image/png", "image/jpeg" | "application/pdf" |
Technical Constraints
Before diving into implementation, understand these limitations:
File Size:
- Max 32 MB per PDF (significantly larger than the 5MB limit for images)
- Max 100 documents across all messages in a single request
Page Limits:
- PDFs are converted to images internally (1 page = 1 image)
- Each page counts toward the 100-image limit per request
- For a 50-page PDF, you can include up to 2 PDFs per request (50 pages × 2 = 100 images)
Token Calculation: Each page is converted to an image and consumes tokens based on dimensions:
tokens_per_page ≈ (width_px × height_px) / 750
For a standard letter-size page at 150 DPI:
- 1275×1650px ≈ 2,805 tokens per page
- 10-page document ≈ 28,000 tokens
- 50-page document ≈ 140,000 tokens
Supported Format:
- Only PDF files are supported (no Word docs, PowerPoint, etc.)
- Convert other formats to PDF first if needed
What Claude Can Extract from PDFs
Claude's PDF processing goes far beyond simple text extraction. It can understand and analyze:
1. Text Content
- Body text, headings, footnotes, captions
- Maintains understanding of document flow and context
- Handles multi-column layouts
2. Images and Charts
- Embedded images, diagrams, and photographs
- Charts and graphs (bar, line, pie, scatter, etc.)
- Can describe visual content and extract data from visualizations
3. Tables
- Structured data in tables
- Relationships between columns and rows
- Can convert tables to JSON, CSV, or other formats
4. Document Structure
- Sections and subsections
- Lists and bullet points
- Headers and footers
- Page numbers and references
5. Complex Layouts
- Multi-column text
- Sidebars and callout boxes
- Mixed text and image layouts
This makes Claude a one-stop solution for document intelligence—no need to chain together OCR, table extraction, and NLP tools.
Citations: Making Claude's Sources Transparent
Why Citations Matter
When Claude answers questions based on documents you provide, users might assume it's just drawing from its training data. But what if Claude could show exactly where it found specific information? That's where citations come in—a powerful feature that lets Claude reference specific parts of your source documents and show users exactly where each piece of information comes from.
Imagine asking Claude about how Earth's atmosphere formed and getting a detailed answer. Without citations, users have no way to verify the information or understand that Claude is actually referencing a specific document you provided. Citations solve this transparency problem by creating a clear trail from Claude's response back to your source material.
Enabling Citations
To enable citations, you need to modify your document message structure. Add two new fields to your document block:
{
"type": "document",
"source": {
"type": "base64",
"media_type": "application/pdf",
"data": file_bytes,
},
"title": "earth.pdf", # New: readable document name
"citations": {"enabled": True} # New: enable citation tracking
}
The title field gives your document a readable name, while citations: {"enabled": True} tells Claude to track where it finds information.
Understanding Citation Structure
When citations are enabled, Claude's response becomes more structured. Instead of simple text, you get data that includes citation information for each claim:
{
"content": [
{
"type": "text",
"text": "Earth's atmosphere is composed of 78% nitrogen and 21% oxygen",
"citations": [
{
"type": "document_citation",
"cited_text": "The atmosphere of Earth is composed of nitrogen (78%), oxygen (21%), argon (0.9%), and trace amounts of other gases.",
"document_index": 0,
"document_title": "earth.pdf",
"start_page_number": 3,
"end_page_number": 3
}
]
}
]
}
Each citation contains several key pieces of information:
cited_text- The exact text from your document that supports Claude's statementdocument_index- Which document Claude is referencing (useful when you provide multiple documents)document_title- The title you assigned to the documentstart_page_number- Where the cited text beginsend_page_number- Where the cited text ends
Implementation with Citations
Here's how to implement document Q&A with citations enabled:
def ask_with_citations(pdf_path, question):
with open(pdf_path, "rb") as f:
file_bytes = base64.standard_b64encode(f.read()).decode("utf-8")
messages = []
add_user_message(messages, [
{
"type": "document",
"source": {
"type": "base64",
"media_type": "application/pdf",
"data": file_bytes,
},
"title": os.path.basename(pdf_path),
"citations": {"enabled": True}
},
{
"type": "text",
"text": f"""Answer this question based on the document:
{question}
Provide a clear, direct answer with specific details."""
},
])
# Get response
response = chat(messages)
# Parse and format response with citations
result = {
"answer": "",
"citations": []
}
for block in response.content:
if block.type == "text":
result["answer"] += block.text
if hasattr(block, 'citations') and block.citations:
for i, citation in enumerate(block.citations, 1):
result["citations"].append({
"id": i,
"text": citation.cited_text,
"document": citation.document_title,
"pages": f"{citation.start_page_number}-{citation.end_page_number}"
})
return result
# Example usage
result = document_qa_with_citations(
"earth.pdf",
"How did Earth's atmosphere form?"
)
print(f"Answer: {result['answer']}")
print(f"\nCitations:")
for citation in result['citations']:
print(f" [{citation['id']}] {citation['document']}, pages {citation['pages']}")
print(f" \"{citation['text'][:100]}...\"")
Sample output:
Answer: Earth's atmosphere formed through volcanic outgassing approximately 4.5 billion years ago, releasing water vapor, carbon dioxide, and nitrogen, which were later transformed by photosynthetic organisms that produced oxygen.
Citations:
[1] earth.pdf, pages 5-5
"Volcanic outgassing released water vapor, carbon dioxide, nitrogen, and other gases from Earth's interior..."
[2] earth.pdf, pages 6-6
"The appearance of photosynthetic organisms around 2.5 billion years ago began producing oxygen..."
Parsing Citation Responses
Extract and display citations from Claude's response:
def parse_citations(response):
"""Extract text and citations from response"""
results = []
for block in response.content:
if block.type == "text":
text_segment = {
"text": block.text,
"citations": []
}
# Extract citations if present
if hasattr(block, 'citations') and block.citations:
for citation in block.citations:
text_segment["citations"].append({
"cited_text": citation.cited_text,
"document": citation.document_title,
"pages": f"{citation.start_page_number}-{citation.end_page_number}"
})
results.append(text_segment)
return results
Building User Interfaces with Citations
The real power of citations comes from building user interfaces that make this information accessible. You can create interactive elements where users can hover over citation markers to see exactly where information came from.
Example UI pattern:
<div class="answer">
<p>
Earth's atmosphere is composed of 78% nitrogen and 21% oxygen
<sup class="citation" data-citation-id="1">[1]</sup>
</p>
</div>
<div class="citation-tooltip" id="citation-1">
<strong>Source:</strong> earth.pdf, page 3<br>
<strong>Cited text:</strong> "The atmosphere of Earth is composed of nitrogen (78%), oxygen (21%), argon (0.9%), and trace amounts of other gases."
</div>
This creates a transparent experience where users can:
✅ See that Claude's answers are grounded in actual source material
✅ Verify the information by checking the original document
✅ Understand the context around each cited piece of information
✅ Navigate directly to the source page in the PDF
Citations with Plain Text
Citations aren't limited to PDF documents. You can also use them with plain text sources. When working with text, modify your document structure like this:
{
"type": "document",
"source": {
"type": "text",
"media_type": "text/plain",
"data": article_text,
},
"title": "earth_article",
"citations": {"enabled": True}
}
With plain text sources, instead of page numbers, you'll get character positions that pinpoint exactly where in the text Claude found each piece of information:
{
"type": "document_citation",
"cited_text": "Earth formed approximately 4.54 billion years ago",
"document_index": 0,
"document_title": "earth_article",
"start_char": 1247,
"end_char": 1295
}
When to Use Citations
Citations are particularly valuable when:
✅ Users need to verify information for accuracy or compliance
✅ Working with authoritative documents that users should be able to reference
✅ Transparency is critical for your application (legal, medical, financial domains)
✅ Users might want to explore the broader context around specific facts
✅ Building trust in AI-generated responses is important
✅ Regulatory requirements mandate source attribution
Best Practices for Citations
1. Always provide meaningful document titles:
# Bad
"title": "doc1.pdf"
# Good
"title": "Q4_2023_Financial_Report.pdf"
2. Display citations in a user-friendly format:
- Use superscript numbers [1], [2] in the text
- Provide hover tooltips with cited text
- Link to the specific page in the PDF viewer
3. Handle missing citations gracefully:
if not result['citations']:
logger.info("No citations found - Claude may be using general knowledge")
# Optionally warn user that answer isn't grounded in provided document
4. Validate citation integrity:
def validate_citations(response, expected_document_title):
"""Ensure citations reference the correct document"""
for block in response.content:
if hasattr(block, 'citations'):
for citation in block.citations:
if citation.document_title != expected_document_title:
logger.warning(
f"Unexpected citation source: {citation.document_title}"
)
5. Store citations for audit trails:
def log_citation_audit(user_id, question, answer, citations):
"""Log Q&A with citations for compliance/audit purposes"""
audit_record = {
"timestamp": datetime.now().isoformat(),
"user_id": user_id,
"question": question,
"answer": answer,
"citations": citations,
"source_documents": [c["document"] for c in citations]
}
# Store in audit log database
save_audit_record(audit_record)
Common Use Cases and Patterns
1. Document Summarization
Task: Generate executive summaries of long documents
Pattern:
def summarize_pdf(pdf_path, length="one paragraph"):
with open(pdf_path, "rb") as f:
file_bytes = base64.standard_b64encode(f.read()).decode("utf-8")
messages = []
add_user_message(messages, [
{
"type": "document",
"source": {
"type": "base64",
"media_type": "application/pdf",
"data": file_bytes,
},
},
{
"type": "text",
"text": f"""Provide a {length} summary of this document covering:
1. Main topic and purpose
2. Key findings or arguments
3. Important conclusions or recommendations
Focus on the most critical information a decision-maker needs to know."""
},
])
return chat(messages)
Example:
Input: 50-page research paper on climate change
Output: "This paper analyzes global temperature trends from 1950-2020, finding that average temperatures have risen 1.2°C with accelerating rates in the past two decades, and recommends immediate policy interventions to limit warming to 1.5°C by 2050."
2. Question Answering
Task: Answer specific questions about document content
Pattern:
def answer_question_from_pdf(pdf_path, question):
with open(pdf_path, "rb") as f:
file_bytes = base64.standard_b64encode(f.read()).decode("utf-8")
messages = []
add_user_message(messages, [
{
"type": "document",
"source": {
"type": "base64",
"media_type": "application/pdf",
"data": file_bytes,
},
},
{
"type": "text",
"text": f"""Answer this question based on the document:
{question}
Provide a direct answer with specific references to relevant sections or page numbers if mentioned in the document."""
},
])
return chat(messages)
Example:
Question: "What is the recommended dosage for patients over 65?"
Answer: "For patients over 65, the recommended dosage is 50mg once daily, as stated in Section 4.2 (Dosage and Administration). The document notes this is half the standard adult dose due to reduced renal clearance in elderly patients."
3. Data Extraction
Task: Extract structured data from tables, forms, or reports
Pattern:
def extract_invoice_data(pdf_path):
with open(pdf_path, "rb") as f:
file_bytes = base64.standard_b64encode(f.read()).decode("utf-8")
messages = []
add_user_message(messages, [
{
"type": "document",
"source": {
"type": "base64",
"media_type": "application/pdf",
"data": file_bytes,
},
},
{
"type": "text",
"text": """Extract the following information from this invoice:
1. Invoice number and date
2. Vendor name and address
3. Customer name and address
4. Line items (description, quantity, unit price, total)
5. Subtotal, tax, and total amount
6. Payment terms and due date
Format your response as JSON with these exact keys:
{
"invoice_number": "",
"invoice_date": "",
"vendor": {"name": "", "address": ""},
"customer": {"name": "", "address": ""},
"line_items": [{"description": "", "quantity": 0, "unit_price": 0, "total": 0}],
"subtotal": 0,
"tax": 0,
"total": 0,
"payment_terms": "",
"due_date": ""
}
If any field is not present, use null."""
},
])
return chat(messages)
4. Contract Analysis
Task: Review contracts for key terms, risks, and obligations
Pattern:
def analyze_contract(pdf_path):
with open(pdf_path, "rb") as f:
file_bytes = base64.standard_b64encode(f.read()).decode("utf-8")
messages = []
add_user_message(messages, [
{
"type": "document",
"source": {
"type": "base64",
"media_type": "application/pdf",
"data": file_bytes,
},
},
{
"type": "text",
"text": """Analyze this contract and provide a structured review:
1. PARTIES: Identify all parties to the agreement
2. KEY TERMS:
- Contract duration and renewal terms
- Payment terms and amounts
- Deliverables and deadlines
- Termination conditions
3. OBLIGATIONS:
- List obligations for each party
- Note any performance guarantees or SLAs
4. RISK FACTORS:
- Identify liability clauses
- Note indemnification requirements
- Flag any unusual or concerning terms
- Highlight missing standard protections
5. RECOMMENDATIONS:
- Suggest areas for negotiation
- Note any red flags requiring legal review
Format each section clearly with bullet points."""
},
])
return chat(messages)
5. Compliance Checking
Task: Verify documents meet regulatory or policy requirements
Pattern:
def check_compliance(pdf_path, requirements):
with open(pdf_path, "rb") as f:
file_bytes = base64.standard_b64encode(f.read()).decode("utf-8")
messages = []
add_user_message(messages, [
{
"type": "document",
"source": {
"type": "base64",
"media_type": "application/pdf",
"data": file_bytes,
},
},
{
"type": "text",
"text": f"""Check if this document meets the following requirements:
{requirements}
For each requirement:
1. State whether it is MET, PARTIALLY MET, or NOT MET
2. Provide specific evidence from the document
3. If not met, explain what is missing
Provide a final compliance score (percentage of requirements met)."""
},
])
return chat(messages)
6. Document Comparison
Task: Compare two versions of a document or two related documents
Pattern:
def compare_documents(pdf_path_1, pdf_path_2):
with open(pdf_path_1, "rb") as f:
file_bytes_1 = base64.standard_b64encode(f.read()).decode("utf-8")
with open(pdf_path_2, "rb") as f:
file_bytes_2 = base64.standard_b64encode(f.read()).decode("utf-8")
messages = []
add_user_message(messages, [
{
"type": "document",
"source": {
"type": "base64",
"media_type": "application/pdf",
"data": file_bytes_1,
},
},
{
"type": "text",
"text": "This is the first document (original version)."
},
{
"type": "document",
"source": {
"type": "base64",
"media_type": "application/pdf",
"data": file_bytes_2,
},
},
{
"type": "text",
"text": """This is the second document (revised version).
Compare these two documents and identify:
1. ADDITIONS: New content in the second document
2. DELETIONS: Content removed from the first document
3. MODIFICATIONS: Content that changed between versions
4. IMPACT ASSESSMENT: Which changes are substantive vs. cosmetic
Focus on meaningful differences, not minor formatting changes."""
},
])
return chat(messages)
Real-World Example: Research Paper Analysis
Let's walk through a complete example: analyzing a Wikipedia article about Earth saved as a PDF.
The Task
Process a multi-page PDF containing a comprehensive article about Earth and extract key information.
Implementation
import base64
def analyze_earth_article(pdf_path):
# Read and encode PDF
with open(pdf_path, "rb") as f:
file_bytes = base64.standard_b64encode(f.read()).decode("utf-8")
# Create message
messages = []
add_user_message(messages, [
{
"type": "document",
"source": {
"type": "base64",
"media_type": "application/pdf",
"data": file_bytes,
},
},
{
"type": "text",
"text": """Analyze this article about Earth and provide:
1. A one-sentence summary
2. The main topics covered (list)
3. Any notable statistics or data points
4. Key images or diagrams mentioned
Keep your response concise and factual."""
},
])
# Get response
response = chat(messages)
return response
Sample Output
1. SUMMARY:
This article provides a comprehensive overview of Earth, covering its formation, structure, atmosphere, climate, and role as humanity's home planet.
2. MAIN TOPICS:
- Formation and geological history (4.5 billion years ago)
- Internal structure (core, mantle, crust)
- Atmosphere composition and layers
- Hydrosphere (oceans, water cycle)
- Biosphere and biodiversity
- Climate systems and patterns
- Human impact and environmental challenges
3. NOTABLE STATISTICS:
- Earth's radius: 6,371 km
- Surface area: 510 million km²
- 71% covered by water
- Atmosphere: 78% nitrogen, 21% oxygen
- Age: 4.54 billion years
- Only known planet with life
4. KEY VISUALS:
- Cross-section diagram showing Earth's layers
- Map of tectonic plates
- Atmospheric composition chart
- Water cycle illustration
This demonstrates Claude's ability to understand complex document structure, extract specific information, and present it in a structured format.
Best Practices for Production Systems
1. Optimize File Size
PDFs can be large. Compress them before sending to reduce token costs and latency:
from PyPDF2 import PdfReader, PdfWriter
import io
def compress_pdf(input_path, output_path, quality=50):
"""Compress PDF by reducing image quality"""
reader = PdfReader(input_path)
writer = PdfWriter()
for page in reader.pages:
# Compress images in page
page.compress_content_streams()
writer.add_page(page)
with open(output_path, "wb") as f:
writer.write(f)
# Check size reduction
original_size = os.path.getsize(input_path) / (1024 * 1024) # MB
compressed_size = os.path.getsize(output_path) / (1024 * 1024) # MB
print(f"Original: {original_size:.2f}MB → Compressed: {compressed_size:.2f}MB")
print(f"Reduction: {((original_size - compressed_size) / original_size * 100):.1f}%")
2. Handle Large Documents
For PDFs exceeding 100 pages, split them into chunks:
from PyPDF2 import PdfReader, PdfWriter
def split_pdf(input_path, pages_per_chunk=50):
"""Split large PDF into smaller chunks"""
reader = PdfReader(input_path)
total_pages = len(reader.pages)
chunks = []
for i in range(0, total_pages, pages_per_chunk):
writer = PdfWriter()
# Add pages to chunk
for page_num in range(i, min(i + pages_per_chunk, total_pages)):
writer.add_page(reader.pages[page_num])
# Save chunk to bytes
chunk_bytes = io.BytesIO()
writer.write(chunk_bytes)
chunk_bytes.seek(0)
chunks.append({
"pages": f"{i+1}-{min(i + pages_per_chunk, total_pages)}",
"data": base64.standard_b64encode(chunk_bytes.read()).decode("utf-8")
})
return chunks
# Process each chunk separately
def process_large_pdf(pdf_path, question):
chunks = split_pdf(pdf_path, pages_per_chunk=50)
results = []
for chunk in chunks:
messages = []
add_user_message(messages, [
{
"type": "document",
"source": {
"type": "base64",
"media_type": "application/pdf",
"data": chunk["data"],
},
},
{
"type": "text",
"text": f"Pages {chunk['pages']}: {question}"
},
])
results.append(chat(messages))
# Combine results
return combine_chunk_results(results)
3. Cache Processed Documents
If you're querying the same document multiple times, cache the encoded PDF:
import hashlib
from functools import lru_cache
@lru_cache(maxsize=100)
def get_pdf_hash(pdf_path):
"""Generate hash of PDF for cache key"""
with open(pdf_path, "rb") as f:
return hashlib.sha256(f.read()).hexdigest()
def process_pdf_cached(pdf_path, query):
cache_key = f"{get_pdf_hash(pdf_path)}_{hash(query)}"
# Check cache
if cache_key in pdf_cache:
return pdf_cache[cache_key]
# Process PDF
result = process_pdf(pdf_path, query)
# Store in cache
pdf_cache[cache_key] = result
return result
4. Validate PDF Files
Check PDFs before sending to avoid errors:
from PyPDF2 import PdfReader
import os
def validate_pdf(pdf_path):
"""Validate PDF before processing"""
errors = []
# Check file exists
if not os.path.exists(pdf_path):
errors.append(f"File not found: {pdf_path}")
return False, errors
# Check file size (max 32MB)
file_size = os.path.getsize(pdf_path) / (1024 * 1024) # MB
if file_size > 32:
errors.append(f"File too large: {file_size:.2f}MB (max 32MB)")
# Check if valid PDF
try:
reader = PdfReader(pdf_path)
page_count = len(reader.pages)
# Check page count (max 100 pages per request)
if page_count > 100:
errors.append(f"Too many pages: {page_count} (max 100 per request)")
# Check if encrypted
if reader.is_encrypted:
errors.append("PDF is encrypted - decrypt before processing")
except Exception as e:
errors.append(f"Invalid PDF: {str(e)}")
return len(errors) == 0, errors
# Use in your processing pipeline
def safe_process_pdf(pdf_path, query):
valid, errors = validate_pdf(pdf_path)
if not valid:
return {"error": "PDF validation failed", "details": errors}
return process_pdf(pdf_path, query)
5. Monitor Token Usage
Track token consumption to optimize costs:
from PyPDF2 import PdfReader
def estimate_pdf_tokens(pdf_path):
"""Estimate token cost for processing a PDF"""
reader = PdfReader(pdf_path)
page_count = len(reader.pages)
# Rough estimate: 2,800 tokens per page (standard letter size at 150 DPI)
tokens_per_page = 2800
estimated_tokens = page_count * tokens_per_page
return {
"pages": page_count,
"estimated_tokens": estimated_tokens,
"estimated_cost_usd": estimated_tokens * 0.000003 # Example rate
}
# Log before processing
def process_pdf_with_logging(pdf_path, query):
estimate = estimate_pdf_tokens(pdf_path)
logger.info(f"Processing PDF: {pdf_path}")
logger.info(f"Pages: {estimate['pages']}")
logger.info(f"Estimated tokens: {estimate['estimated_tokens']:,}")
logger.info(f"Estimated cost: ${estimate['estimated_cost_usd']:.4f}")
# Alert if unusually expensive
if estimate['estimated_tokens'] > 100000:
logger.warning(f"High token usage: {estimate['estimated_tokens']:,} tokens")
return process_pdf(pdf_path, query)
Common Pitfalls and Solutions
Pitfall 1: Sending Encrypted PDFs
Problem: Claude cannot process password-protected or encrypted PDFs.
Solution: Decrypt PDFs before processing:
from PyPDF2 import PdfReader, PdfWriter
def decrypt_pdf(input_path, output_path, password):
reader = PdfReader(input_path)
if reader.is_encrypted:
reader.decrypt(password)
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
with open(output_path, "wb") as f:
writer.write(f)
Pitfall 2: Expecting Perfect OCR on Scanned Documents
Problem: Scanned PDFs (images of text) may have lower accuracy than native text PDFs.
Solution: For critical applications, pre-process scanned PDFs with dedicated OCR:
import pytesseract
from pdf2image import convert_from_path
from PyPDF2 import PdfWriter
def ocr_scanned_pdf(input_path, output_path):
"""Convert scanned PDF to searchable PDF"""
# Convert PDF to images
images = convert_from_path(input_path)
# OCR each page
writer = PdfWriter()
for image in images:
# Run OCR
text = pytesseract.image_to_pdf_or_hocr(image, extension='pdf')
# Add to output PDF
# ... (implementation details)
# Save searchable PDF
with open(output_path, "wb") as f:
writer.write(f)
Pitfall 3: Not Handling Multi-Document Requests
Problem: Sending multiple large PDFs in one request exceeds the 100-page limit.
Solution: Calculate total pages before sending:
def process_multiple_pdfs(pdf_paths, query):
# Calculate total pages
total_pages = sum(len(PdfReader(path).pages) for path in pdf_paths)
if total_pages > 100:
return {
"error": "Too many pages",
"total_pages": total_pages,
"max_pages": 100,
"suggestion": "Process PDFs separately or reduce page count"
}
# Process all PDFs together
# ...
Pitfall 4: Vague Queries
Problem: Asking "What's in this document?" produces generic summaries.
Solution: Be specific about what you need:
# Bad
query = "Summarize this document"
# Good
query = """Summarize this contract focusing on:
1. Payment terms and amounts
2. Termination conditions
3. Liability limitations
4. Any unusual or non-standard clauses"""
Pitfall 5: Not Validating Structured Outputs
Problem: Expecting JSON but getting prose, or malformed JSON.
Solution: Validate and retry if needed:
import json
def extract_structured_data(pdf_path, schema):
response = process_pdf(pdf_path, f"Extract data as JSON: {schema}")
try:
data = json.loads(response)
# Validate against schema
validate_schema(data, schema)
return data
except json.JSONDecodeError:
# Retry with more explicit instructions
response = process_pdf(
pdf_path,
f"Extract data as valid JSON only (no markdown, no explanation): {schema}"
)
return json.loads(response)
Conclusion
Claude's PDF processing capabilities transform document analysis from a multi-tool pipeline into a single, intelligent API call. Whether you're summarizing reports, extracting data, reviewing contracts, or answering questions, Claude can understand the full context of your documents—text, images, tables, and structure.
By implementing citations, you transform Claude from a "black box" that provides answers into a transparent research assistant that shows its work. This builds user trust and enables them to dive deeper into your source materials when needed.
Key Takeaways
✅ Simple implementation - Nearly identical to image processing with document blocks
✅ Comprehensive understanding - Handles text, images, tables, and complex layouts
✅ Flexible queries - From simple summaries to complex structured extraction
✅ Production-ready - With proper validation, caching, and error handling
✅ Cost-effective - ~2,800 tokens per page for standard documents
✅ Transparent with citations - Show users exactly where information comes from
Best Practices Recap
- Validate PDFs before processing (size, encryption, page count)
- Compress large files to reduce token costs
- Split documents over 100 pages into chunks
- Cache results for repeated queries on the same document
- Be specific in your queries for better results
- Validate outputs especially for structured data extraction
- Monitor costs by tracking token usage per document
- Enable citations when transparency and verifiability matter
When to Use PDF Processing
Use Claude's PDF processing when:
- You need to understand document context, not just extract text
- Documents contain mixed content (text + images + tables)
- You want to ask questions or perform analysis, not just parse
- You need flexibility in output format (summaries, JSON, Q&A, etc.)
- Transparency and source attribution are important
Use traditional tools when:
- You only need raw text extraction (use PyPDF2, pdfplumber)
- You're processing thousands of documents in batch (cost considerations)
- You need pixel-perfect layout preservation
- Documents are extremely large (500+ pages)
Claude's PDF processing shines when you need intelligence, not just extraction. It understands context, relationships, and meaning—turning static documents into queryable knowledge with verifiable sources.
Further Reading
- Claude Document Processing Documentation
- Citations Documentation
- Prompt Engineering for Documents
- Vision Capabilities Guide (similar concepts apply)
- Token Counting and Optimization
Ready to process PDFs with Claude? Start with a simple document, ask a specific question, and iterate based on results. Enable citations to build trust and transparency. The same prompt engineering principles that work for text and images apply to documents—be specific, provide structure, and test thoroughly.