Claude Github

Building the Core Evaluation Pipeline

Now that we have our evaluation dataset ready, it's time to build the core evaluation pipeline. This involves taking each test case, merging it with our prompt, feeding it to Claude, and then grading the results.

The evaluation process follows a clear workflow: we take our dataset of test cases, combine each one with our prompt template, send it to Claude for processing, and then evaluate the output using a grader system.

Building the Core Functions

The evaluation pipeline consists of three main functions, each with a specific responsibility. Let's start with the simplest one—the function that handles individual prompts.

The `run_prompt` Function

This function takes a test case and merges it with our prompt template:

def run_prompt(test_case):
    """Merges the prompt and test case input, then returns the result"""
    prompt = f"""
Please solve the following task:

{test_case["task"]}
"""
    
    messages = []
    add_user_message(messages, prompt)
    output = chat(messages)
    return output

What's happening here:

We take the task field from our test case
We merge it into our prompt template
We send it to Claude using our chat helper function
We return Claude's raw output

Right now, we're keeping the prompt extremely simple. We're not including any formatting instructions, so Claude will likely return more verbose output than we need. We'll refine this later as we iterate on our prompt design.

The `run_test_case` Function

This function orchestrates running a single test case and grading the result:

def run_test_case(test_case):
    """Calls run_prompt, then grades the result"""
    output = run_prompt(test_case)
    
    # TODO - Grading
    score = 10
    
    return {
        "output": output,
        "test_case": test_case,
        "score": score
    }

What's happening here:

We call run_prompt to get Claude's response
We grade the output (currently hardcoded to 10)
We return a structured result containing:
- The output from Claude
- The original test case
- The score

For now, we're using a hardcoded score of 10. The grading logic is where we'll spend significant time in upcoming sections, but this placeholder lets us test the overall pipeline.

The `run_eval` Function

This function coordinates the entire evaluation process:

def run_eval(dataset):
    """Loads the dataset and calls run_test_case with each case"""
    results = []
    
    for test_case in dataset:
        result = run_test_case(test_case)
        results.append(result)
    
    return results

What's happening here:

We iterate through every test case in our dataset
For each one, we call run_test_case to get a result
We collect all results into a single list
We return the complete evaluation results

This function processes every test case in our dataset and collects all the results into a single list.

Running the Evaluation

To execute our evaluation pipeline, we load our dataset and run it through our functions:

with open("dataset.json", "r") as f:
    dataset = json.load(f)

results = run_eval(dataset)

Performance note: The first time you run this, expect it to take some time—even with Claude Haiku, it can take around 30 seconds to process a full dataset. We'll cover optimization techniques later.

Examining the Results

The evaluation returns a structured JSON array where each object represents one test case result:

print(json.dumps(results, indent=2))

Example Output Structure

[
  {
    "output": "Here's a Python function that lists all S3 buckets in an AWS account:\n\n```python\nimport boto3\n\ndef list_s3_buckets():\n    \"\"\"\n    Lists all S3 buckets in the AWS account.\n    \n    Returns:\n        list: A list of bucket names\n    \"\"\"\n    s3_client = boto3.client('s3')\n    response = s3_client.list_buckets()\n    \n    buckets = [bucket['Name'] for bucket in response['Buckets']]\n    return buckets\n\n# Example usage\nif __name__ == \"__main__\":\n    buckets = list_s3_buckets()\n    print(f\"Found {len(buckets)} buckets:\")\n    for bucket in buckets:\n        print(f\"  - {bucket}\")\n```\n\nThis function uses the boto3 library to interact with AWS S3...",
    "test_case": {
      "task": "Write a Python function that lists all S3 buckets in an AWS account"
    },
    "score": 10
  },
  {
    "output": "Here's a JSON configuration for an AWS Lambda function with the specified settings:\n\n```json\n{\n  \"FunctionName\": \"my-lambda-function\",\n  \"Runtime\": \"python3.9\",\n  \"Role\": \"arn:aws:iam::123456789012:role/lambda-execution-role\",\n  \"Handler\": \"lambda_function.lambda_handler\",\n  \"MemorySize\": 512,\n  \"Timeout\": 30\n}\n```\n\nLet me explain each field...",
    "test_case": {
      "task": "Create a JSON configuration for an AWS Lambda function with 512MB memory and a 30-second timeout"
    },
    "score": 10
  },
  {
    "output": "Here's a regex pattern to validate AWS ARN format for S3 buckets:\n\n```regex\n^arn:aws:s3:::[a-z0-9][a-z0-9\\-]{1,61}[a-z0-9]$\n```\n\nThis pattern validates:\n- The ARN prefix `arn:aws:s3:::`\n- Bucket names that start and end with lowercase letters or numbers\n- Bucket names between 3-63 characters\n- Hyphens allowed in the middle...",
    "test_case": {
      "task": "Write a regex pattern to validate AWS ARN format for S3 buckets"
    },
    "score": 10
  }
]

What Each Result Contains

Each result contains three key pieces of information:

output: The complete response from Claude
test_case: The original test case that was processed
score: The evaluation score (currently hardcoded to 10)

As you can see in the output, Claude generates quite verbose responses since we haven't provided specific formatting instructions yet. This is exactly the kind of issue we'll address as we refine our prompts.

Analyzing the Current Output

Looking at the results, we can identify several patterns:

✅ What's Working

Correctness: Claude is solving the tasks accurately
Code quality: The Python, JSON, and regex outputs are syntactically correct
Completeness: Each response includes working code

⚠️ What Needs Improvement

Verbosity: Claude adds explanations, comments, and example usage
Format inconsistency: Some responses include markdown code blocks, others don't
Extra content: We asked for clean output, but we're getting tutorials

This is where prompt iteration comes in. Our next step will be to refine the prompt to get cleaner, more focused output.

What We've Accomplished

At this point, we've successfully built the core evaluation pipeline. We can:

✅ Load our dataset of test cases
✅ Process each one through Claude
✅ Collect structured results with outputs and metadata
✅ Examine the results to identify patterns

The major missing piece is the grading system—that hardcoded score of 10 needs to be replaced with actual evaluation logic.

The Foundation of AI Evaluation

This pipeline represents the foundation of most AI evaluation systems. While it may seem simple, you've just built the majority of what an eval pipeline actually does. The complexity comes in the details:

Better prompts: Refining instructions to get the exact output format you need
Sophisticated grading: Replacing hardcoded scores with objective evaluation criteria
Performance optimizations: Batching requests, parallel processing, caching results

Next Steps

In the upcoming sections, we'll tackle the critical missing piece: graders.

We'll explore:

Rule-based grading: Checking for specific patterns in the output
LLM-as-judge grading: Using Claude to evaluate Claude's own outputs
Hybrid approaches: Combining multiple grading strategies
Scoring rubrics: Defining what makes a "10" vs. a "5"

Graders will transform our hardcoded scores into meaningful evaluations of Claude's performance, allowing us to objectively measure prompt improvements.

Full Code So Far

Here's the complete evaluation pipeline code:

import json
from anthropic import Anthropic

client = Anthropic(api_key="your-api-key")
model = "claude-3-haiku-20240307"

def add_user_message(messages, text):
    user_message = {"role": "user", "content": text}
    messages.append(user_message)

def add_assistant_message(messages, text):
    assistant_message = {"role": "assistant", "content": text}
    messages.append(assistant_message)

def chat(messages, system=None, temperature=1.0, stop_sequences=[]):
    params = {
        "model": model,
        "max_tokens": 1000,
        "messages": messages,
        "temperature": temperature
    }
    if system:
        params["system"] = system
    if stop_sequences:
        params["stop_sequences"] = stop_sequences
    
    response = client.messages.create(**params)
    return response.content[0].text

def run_prompt(test_case):
    """Merges the prompt and test case input, then returns the result"""
    prompt = f"""
Please solve the following task:

{test_case["task"]}
"""
    
    messages = []
    add_user_message(messages, prompt)
    output = chat(messages)
    return output

def run_test_case(test_case):
    """Calls run_prompt, then grades the result"""
    output = run_prompt(test_case)
    
    # TODO - Grading
    score = 10
    
    return {
        "output": output,
        "test_case": test_case,
        "score": score
    }

def run_eval(dataset):
    """Loads the dataset and calls run_test_case with each case"""
    results = []
    
    for test_case in dataset:
        result = run_test_case(test_case)
        results.append(result)
    
    return results

# Load dataset and run evaluation
with open("dataset.json", "r") as f:
    dataset = json.load(f)

results = run_eval(dataset)

# Display results
print(json.dumps(results, indent=2))

# Calculate average score (currently all 10s)
avg_score = sum(r["score"] for r in results) / len(results)
print(f"\n📊 Average Score: {avg_score}/10")

Key Takeaways

Separation of concerns: Each function has a single, clear responsibility
Structured output: Results are consistently formatted for easy analysis
Iterative design: We start simple and add complexity where needed
Foundation first: Build the pipeline before optimizing grading logic
Real data reveals issues: Running the eval shows us exactly what needs improvement

Next, we'll dive into the critical topic of graders, which will transform our hardcoded scores into meaningful evaluations of Claude's performance.

Ready to build sophisticated grading systems? Continue to the next guide on implementing graders for prompt evaluation.

Building the Core Evaluation Pipeline

Building the Core Functions

The run_prompt Function

The run_test_case Function

The run_eval Function

Running the Evaluation

Examining the Results

Example Output Structure

What Each Result Contains

Analyzing the Current Output

✅ What's Working

⚠️ What Needs Improvement

What We've Accomplished

The Foundation of AI Evaluation

Next Steps

Full Code So Far

Key Takeaways

The `run_prompt` Function

The `run_test_case` Function

The `run_eval` Function