- Published on
Building the Core Evaluation Pipeline
- Authors

- Name
- Vuk Dukic
Founder, Senior Software Engineer

Building the Core Evaluation Pipeline
Now that we have our evaluation dataset ready, it's time to build the core evaluation pipeline. This involves taking each test case, merging it with our prompt, feeding it to Claude, and then grading the results.
The evaluation process follows a clear workflow: we take our dataset of test cases, combine each one with our prompt template, send it to Claude for processing, and then evaluate the output using a grader system.
Building the Core Functions
The evaluation pipeline consists of three main functions, each with a specific responsibility. Let's start with the simplest one—the function that handles individual prompts.
The run_prompt Function
This function takes a test case and merges it with our prompt template:
def run_prompt(test_case):
"""Merges the prompt and test case input, then returns the result"""
prompt = f"""
Please solve the following task:
{test_case["task"]}
"""
messages = []
add_user_message(messages, prompt)
output = chat(messages)
return output
What's happening here:
- We take the
taskfield from our test case - We merge it into our prompt template
- We send it to Claude using our
chathelper function - We return Claude's raw output
Right now, we're keeping the prompt extremely simple. We're not including any formatting instructions, so Claude will likely return more verbose output than we need. We'll refine this later as we iterate on our prompt design.
The run_test_case Function
This function orchestrates running a single test case and grading the result:
def run_test_case(test_case):
"""Calls run_prompt, then grades the result"""
output = run_prompt(test_case)
# TODO - Grading
score = 10
return {
"output": output,
"test_case": test_case,
"score": score
}
What's happening here:
- We call
run_promptto get Claude's response - We grade the output (currently hardcoded to 10)
- We return a structured result containing:
- The output from Claude
- The original test case
- The score
For now, we're using a hardcoded score of 10. The grading logic is where we'll spend significant time in upcoming sections, but this placeholder lets us test the overall pipeline.
The run_eval Function
This function coordinates the entire evaluation process:
def run_eval(dataset):
"""Loads the dataset and calls run_test_case with each case"""
results = []
for test_case in dataset:
result = run_test_case(test_case)
results.append(result)
return results
What's happening here:
- We iterate through every test case in our dataset
- For each one, we call
run_test_caseto get a result - We collect all results into a single list
- We return the complete evaluation results
This function processes every test case in our dataset and collects all the results into a single list.
Running the Evaluation
To execute our evaluation pipeline, we load our dataset and run it through our functions:
with open("dataset.json", "r") as f:
dataset = json.load(f)
results = run_eval(dataset)
Performance note: The first time you run this, expect it to take some time—even with Claude Haiku, it can take around 30 seconds to process a full dataset. We'll cover optimization techniques later.
Examining the Results
The evaluation returns a structured JSON array where each object represents one test case result:
print(json.dumps(results, indent=2))
Example Output Structure
[
{
"output": "Here's a Python function that lists all S3 buckets in an AWS account:\n\n```python\nimport boto3\n\ndef list_s3_buckets():\n \"\"\"\n Lists all S3 buckets in the AWS account.\n \n Returns:\n list: A list of bucket names\n \"\"\"\n s3_client = boto3.client('s3')\n response = s3_client.list_buckets()\n \n buckets = [bucket['Name'] for bucket in response['Buckets']]\n return buckets\n\n# Example usage\nif __name__ == \"__main__\":\n buckets = list_s3_buckets()\n print(f\"Found {len(buckets)} buckets:\")\n for bucket in buckets:\n print(f\" - {bucket}\")\n```\n\nThis function uses the boto3 library to interact with AWS S3...",
"test_case": {
"task": "Write a Python function that lists all S3 buckets in an AWS account"
},
"score": 10
},
{
"output": "Here's a JSON configuration for an AWS Lambda function with the specified settings:\n\n```json\n{\n \"FunctionName\": \"my-lambda-function\",\n \"Runtime\": \"python3.9\",\n \"Role\": \"arn:aws:iam::123456789012:role/lambda-execution-role\",\n \"Handler\": \"lambda_function.lambda_handler\",\n \"MemorySize\": 512,\n \"Timeout\": 30\n}\n```\n\nLet me explain each field...",
"test_case": {
"task": "Create a JSON configuration for an AWS Lambda function with 512MB memory and a 30-second timeout"
},
"score": 10
},
{
"output": "Here's a regex pattern to validate AWS ARN format for S3 buckets:\n\n```regex\n^arn:aws:s3:::[a-z0-9][a-z0-9\\-]{1,61}[a-z0-9]$\n```\n\nThis pattern validates:\n- The ARN prefix `arn:aws:s3:::`\n- Bucket names that start and end with lowercase letters or numbers\n- Bucket names between 3-63 characters\n- Hyphens allowed in the middle...",
"test_case": {
"task": "Write a regex pattern to validate AWS ARN format for S3 buckets"
},
"score": 10
}
]
What Each Result Contains
Each result contains three key pieces of information:
output: The complete response from Claudetest_case: The original test case that was processedscore: The evaluation score (currently hardcoded to 10)
As you can see in the output, Claude generates quite verbose responses since we haven't provided specific formatting instructions yet. This is exactly the kind of issue we'll address as we refine our prompts.
Analyzing the Current Output
Looking at the results, we can identify several patterns:
✅ What's Working
- Correctness: Claude is solving the tasks accurately
- Code quality: The Python, JSON, and regex outputs are syntactically correct
- Completeness: Each response includes working code
⚠️ What Needs Improvement
- Verbosity: Claude adds explanations, comments, and example usage
- Format inconsistency: Some responses include markdown code blocks, others don't
- Extra content: We asked for clean output, but we're getting tutorials
This is where prompt iteration comes in. Our next step will be to refine the prompt to get cleaner, more focused output.
What We've Accomplished
At this point, we've successfully built the core evaluation pipeline. We can:
✅ Load our dataset of test cases
✅ Process each one through Claude
✅ Collect structured results with outputs and metadata
✅ Examine the results to identify patterns
The major missing piece is the grading system—that hardcoded score of 10 needs to be replaced with actual evaluation logic.
The Foundation of AI Evaluation
This pipeline represents the foundation of most AI evaluation systems. While it may seem simple, you've just built the majority of what an eval pipeline actually does. The complexity comes in the details:
- Better prompts: Refining instructions to get the exact output format you need
- Sophisticated grading: Replacing hardcoded scores with objective evaluation criteria
- Performance optimizations: Batching requests, parallel processing, caching results
Next Steps
In the upcoming sections, we'll tackle the critical missing piece: graders.
We'll explore:
- Rule-based grading: Checking for specific patterns in the output
- LLM-as-judge grading: Using Claude to evaluate Claude's own outputs
- Hybrid approaches: Combining multiple grading strategies
- Scoring rubrics: Defining what makes a "10" vs. a "5"
Graders will transform our hardcoded scores into meaningful evaluations of Claude's performance, allowing us to objectively measure prompt improvements.
Full Code So Far
Here's the complete evaluation pipeline code:
import json
from anthropic import Anthropic
client = Anthropic(api_key="your-api-key")
model = "claude-3-haiku-20240307"
def add_user_message(messages, text):
user_message = {"role": "user", "content": text}
messages.append(user_message)
def add_assistant_message(messages, text):
assistant_message = {"role": "assistant", "content": text}
messages.append(assistant_message)
def chat(messages, system=None, temperature=1.0, stop_sequences=[]):
params = {
"model": model,
"max_tokens": 1000,
"messages": messages,
"temperature": temperature
}
if system:
params["system"] = system
if stop_sequences:
params["stop_sequences"] = stop_sequences
response = client.messages.create(**params)
return response.content[0].text
def run_prompt(test_case):
"""Merges the prompt and test case input, then returns the result"""
prompt = f"""
Please solve the following task:
{test_case["task"]}
"""
messages = []
add_user_message(messages, prompt)
output = chat(messages)
return output
def run_test_case(test_case):
"""Calls run_prompt, then grades the result"""
output = run_prompt(test_case)
# TODO - Grading
score = 10
return {
"output": output,
"test_case": test_case,
"score": score
}
def run_eval(dataset):
"""Loads the dataset and calls run_test_case with each case"""
results = []
for test_case in dataset:
result = run_test_case(test_case)
results.append(result)
return results
# Load dataset and run evaluation
with open("dataset.json", "r") as f:
dataset = json.load(f)
results = run_eval(dataset)
# Display results
print(json.dumps(results, indent=2))
# Calculate average score (currently all 10s)
avg_score = sum(r["score"] for r in results) / len(results)
print(f"\n📊 Average Score: {avg_score}/10")
Key Takeaways
- Separation of concerns: Each function has a single, clear responsibility
- Structured output: Results are consistently formatted for easy analysis
- Iterative design: We start simple and add complexity where needed
- Foundation first: Build the pipeline before optimizing grading logic
- Real data reveals issues: Running the eval shows us exactly what needs improvement
Next, we'll dive into the critical topic of graders, which will transform our hardcoded scores into meaningful evaluations of Claude's performance.
Ready to build sophisticated grading systems? Continue to the next guide on implementing graders for prompt evaluation.