Claude Advanced

Model-Based Grading for Prompt Evaluation

When building prompt evaluation workflows, grading systems provide objective signals about output quality. A grader takes model output and returns some kind of measurable feedback—typically a number between 1 and 10, where 10 represents high quality and 1 represents poor quality.

Grading is the critical component that transforms your evaluation pipeline from a data collection tool into a measurement system. Without grading, you can see what your prompt produces, but you can't objectively compare different versions or track improvements over time.

Types of Graders

There are three main approaches to grading model outputs:

1. Code Graders

Programmatically evaluate outputs using custom logic

2. Model Graders

Use another AI model to assess the quality

3. Human Graders

Have people manually review and score outputs

Each approach has distinct strengths and use cases. Let's explore them in detail.

Code Graders

Code graders let you implement any programmatic check you can imagine. They're fast, deterministic, and cost-free once implemented.

Common Use Cases

Checking output length: Ensure responses aren't too brief or verbose
Verifying output does/doesn't have certain words: Flag unwanted phrases or required terminology
Syntax validation: Check if JSON, Python, or regex is valid
Readability scores: Calculate Flesch-Kincaid or similar metrics
Format compliance: Verify outputs match expected structure

Example: Simple Code Grader

def grade_by_code(test_case, output):
    score = 10
    issues = []
    
    # Check 1: Output shouldn't be too short
    if len(output) < 20:
        score -= 3
        issues.append("Output too brief")
    
    # Check 2: Should not include explanatory text
    forbidden_phrases = ["Here's", "Let me explain", "This code"]
    for phrase in forbidden_phrases:
        if phrase in output:
            score -= 2
            issues.append(f"Contains explanatory phrase: '{phrase}'")
    
    # Check 3: Should contain code block markers
    if "```" not in output:
        score -= 1
        issues.append("Missing code block formatting")
    
    return {
        "score": max(1, score),
        "issues": issues
    }

Advantages of Code Graders

✅ Fast: No API calls, instant results
✅ Deterministic: Same input always produces same score
✅ Free: No additional costs
✅ Transparent: Easy to understand why a score was assigned

Limitations

❌ Rigid: Can't assess nuanced quality or semantic correctness
❌ Brittle: May miss edge cases or produce false positives
❌ Limited scope: Best for format/structure, not content quality

Model Graders

Model graders feed your original output into another API call for evaluation. This approach offers tremendous flexibility for assessing subjective or complex criteria.

Common Use Cases

Response quality: Overall coherence and usefulness
Instruction following: How well the output matches the request
Completeness: Whether all aspects of the task are addressed
Helpfulness: Practical value to the end user
Safety: Absence of harmful, biased, or inappropriate content
Accuracy: Factual correctness (when ground truth is available)

Why Model Graders Work

LLMs excel at nuanced evaluation tasks that would be difficult to encode in rules:

Understanding semantic similarity
Assessing tone and style
Evaluating logical coherence
Judging task completion

Advantages of Model Graders

✅ Flexible: Can evaluate any criteria you can describe
✅ Nuanced: Captures subtle quality differences
✅ Scalable: Handles diverse output types
✅ Consistent: More reliable than human graders at scale

Limitations

❌ Costly: Requires additional API calls
❌ Slower: Adds latency to evaluation pipeline
❌ Variable: Can produce inconsistent scores across runs
❌ Opaque: Harder to debug than code graders

Human Graders

Human graders provide the most flexibility but are time-consuming and tedious. They're the gold standard for subjective quality assessment.

Common Use Cases

General response quality: Overall impression and satisfaction
Comprehensiveness: Depth and breadth of coverage
Depth: Level of detail and insight
Conciseness: Efficiency of communication
Relevance: Alignment with user intent
Edge case validation: Unusual scenarios that automated graders miss

When to Use Human Graders

High-stakes applications: Medical, legal, or safety-critical domains
Calibration: Establishing ground truth for training automated graders
Spot-checking: Validating automated grader accuracy
Final validation: Before production deployment

Advantages

✅ Most accurate: Humans understand context and nuance best
✅ Catches edge cases: Identifies issues automated systems miss
✅ Builds intuition: Helps you understand failure modes

Limitations

❌ Expensive: Requires paid human time
❌ Slow: Bottleneck for rapid iteration
❌ Inconsistent: Inter-rater reliability varies
❌ Not scalable: Can't evaluate thousands of outputs

Defining Evaluation Criteria

Before implementing any grader, you need clear evaluation criteria. Vague criteria lead to unreliable scores.

Example: Code Generation Prompt

For a prompt that generates AWS-specific code, you might define:

1. Format Compliance

Requirement: Should return only Python, JSON, or Regex without explanation
Grader type: Code grader
Scoring: -3 points for each explanatory sentence

2. Valid Syntax

Requirement: Produced code should have valid syntax
Grader type: Code grader (using AST parsing for Python, JSON.parse for JSON)
Scoring: 0 points if syntax is invalid, no deduction if valid

3. Task Following

Requirement: Response should directly address the user's task with accurate code
Grader type: Model grader
Scoring: 1-10 scale based on correctness and completeness

The first two criteria work well with code graders because they're objective and rule-based. Task following is better suited for model graders due to the flexibility required to assess semantic correctness.

Implementing a Model Grader

Here's how to build a robust model grader function:

def grade_by_model(test_case, output):
    """
    Uses an LLM to evaluate the quality of generated output.
    Returns a structured evaluation with score, strengths, weaknesses, and reasoning.
    """
    
    # Create evaluation prompt
    eval_prompt = f"""
You are an expert code reviewer. Evaluate this AI-generated solution.

**Task**: {test_case["task"]}

**Solution**: 
{output}

Provide your evaluation as a structured JSON object with:
- "strengths": An array of 1-3 key strengths
- "weaknesses": An array of 1-3 key areas for improvement  
- "reasoning": A concise explanation of your assessment
- "score": A number between 1-10 (where 10 is perfect)

Evaluation criteria:
1. Correctness: Does the code solve the task accurately?
2. Format: Is it clean output without unnecessary explanation?
3. Best practices: Does it follow AWS/Python conventions?
"""
    
    messages = []
    add_user_message(messages, eval_prompt)
    add_assistant_message(messages, "```json")
    
    eval_text = chat(messages, stop_sequences=["```"])
    return json.loads(eval_text)

Key Design Decisions

1. Ask for Structured Output

Using JSON ensures consistent parsing and makes it easy to extract the score programmatically.

2. Request Strengths and Weaknesses

Without this context, models tend to default to middling scores around 6. Asking for specific strengths and weaknesses forces the model to justify its score.

3. Include Reasoning

The reasoning field helps you understand why a score was assigned, making it easier to debug unexpected results.

4. Use Prefilling and Stop Sequences

Starting the assistant message with "```json" and using "```" as a stop sequence ensures clean JSON output.

Example Model Grader Output

{
  "strengths": [
    "Code is syntactically correct and uses boto3 properly",
    "Function includes proper error handling",
    "Follows Python naming conventions"
  ],
  "weaknesses": [
    "Includes unnecessary explanatory comments",
    "Example usage section wasn't requested",
    "Could be more concise"
  ],
  "reasoning": "The solution correctly solves the task and demonstrates good coding practices, but includes extra content beyond what was requested. The code itself is solid, but the format doesn't match the requirement for clean output only.",
  "score": 7
}

Integrating Grading into Your Workflow

Update your run_test_case function to call the grader:

def run_test_case(test_case):
    """Runs a single test case and grades the result"""
    output = run_prompt(test_case)
    
    # Grade the output using model grader
    model_grade = grade_by_model(test_case, output)
    score = model_grade["score"]
    reasoning = model_grade["reasoning"]
    strengths = model_grade["strengths"]
    weaknesses = model_grade["weaknesses"]
    
    return {
        "output": output, 
        "test_case": test_case, 
        "score": score,
        "reasoning": reasoning,
        "strengths": strengths,
        "weaknesses": weaknesses
    }

Calculating Average Scores

Finally, calculate an average score across all test cases to get a single metric for your prompt's performance:

from statistics import mean

def run_eval(dataset):
    """Runs evaluation on entire dataset and returns results with average score"""
    results = []
    
    for test_case in dataset:
        result = run_test_case(test_case)
        results.append(result)
    
    # Calculate average score
    average_score = mean([result["score"] for result in results])
    print(f"📊 Average score: {average_score:.2f}/10")
    
    # Show score distribution
    scores = [result["score"] for result in results]
    print(f"   Score range: {min(scores)} - {max(scores)}")
    print(f"   Total test cases: {len(results)}")
    
    return results

Example Output

📊 Average score: 7.33/10
   Score range: 6 - 9
   Total test cases: 3

This gives you an objective metric to track as you iterate on your prompt. While model graders can be somewhat variable, they provide a consistent baseline for measuring improvements.

Combining Code and Model Graders

For the most robust evaluation, combine both approaches:

def grade_combined(test_case, output):
    """Combines code-based and model-based grading"""
    
    # Code grader: Check format compliance
    code_grade = grade_by_code(test_case, output)
    
    # Model grader: Check task completion and quality
    model_grade = grade_by_model(test_case, output)
    
    # Weighted average (60% model, 40% code)
    final_score = (model_grade["score"] * 0.6) + (code_grade["score"] * 0.4)
    
    return {
        "score": final_score,
        "code_grade": code_grade,
        "model_grade": model_grade
    }

This hybrid approach leverages the strengths of both methods:

Code graders catch format violations quickly and cheaply
Model graders assess semantic quality and task completion

Best Practices for Model Graders

1. Be Specific in Evaluation Criteria

Vague instructions like "evaluate quality" produce inconsistent results. Instead:

eval_prompt = """
Evaluate on these specific criteria:
1. Correctness (0-4 points): Does the code work?
2. Format (0-3 points): Is it clean without explanation?
3. Best practices (0-3 points): Does it follow conventions?

Sum the points for a total score out of 10.
"""

2. Use Temperature 0 for Consistency

Lower temperature reduces variability in grading:

eval_text = chat(messages, temperature=0.0, stop_sequences=["```"])

3. Provide Examples of Good and Bad Outputs

Few-shot examples calibrate the grader:

eval_prompt = """
Example of a 10/10 response:
```python
import boto3
s3 = boto3.client('s3')
buckets = s3.list_buckets()

Example of a 5/10 response: Here's how to list S3 buckets. First, you need to import boto3... [includes unnecessary explanation]

Now evaluate this response: {output} """


### 4. **Validate Grader Reliability**
Run the same output through the grader multiple times to check consistency:

```python
scores = [grade_by_model(test_case, output)["score"] for _ in range(5)]
print(f"Score variance: {max(scores) - min(scores)}")

If variance is high (>2 points), refine your evaluation prompt.

Full Code Example

import json
from statistics import mean
from anthropic import Anthropic

client = Anthropic(api_key="your-api-key")
model = "claude-3-haiku-20240307"

def grade_by_model(test_case, output):
    eval_prompt = f"""
You are an expert code reviewer. Evaluate this AI-generated solution.

**Task**: {test_case["task"]}

**Solution**: 
{output}

Provide your evaluation as a structured JSON object with:
- "strengths": An array of 1-3 key strengths
- "weaknesses": An array of 1-3 key areas for improvement  
- "reasoning": A concise explanation of your assessment
- "score": A number between 1-10

Evaluation criteria:
1. Correctness: Does the code solve the task accurately?
2. Format: Is it clean output without unnecessary explanation?
3. Best practices: Does it follow conventions?
"""
    
    messages = []
    add_user_message(messages, eval_prompt)
    add_assistant_message(messages, "```json")
    
    eval_text = chat(messages, temperature=0.0, stop_sequences=["```"])
    return json.loads(eval_text)

def run_test_case(test_case):
    output = run_prompt(test_case)
    model_grade = grade_by_model(test_case, output)
    
    return {
        "output": output, 
        "test_case": test_case, 
        "score": model_grade["score"],
        "reasoning": model_grade["reasoning"],
        "strengths": model_grade["strengths"],
        "weaknesses": model_grade["weaknesses"]
    }

def run_eval(dataset):
    results = []
    
    for test_case in dataset:
        result = run_test_case(test_case)
        results.append(result)
    
    average_score = mean([result["score"] for result in results])
    print(f"📊 Average score: {average_score:.2f}/10")
    
    return results

# Run evaluation
with open("dataset.json", "r") as f:
    dataset = json.load(f)

results = run_eval(dataset)

Key Takeaways

Choose the right grader type: Code graders for format/syntax, model graders for semantic quality, human graders for high-stakes validation
Define clear criteria: Vague evaluation instructions produce unreliable scores
Request structured output: JSON with score, reasoning, strengths, and weaknesses
Combine approaches: Hybrid grading leverages strengths of multiple methods
Validate consistency: Test grader reliability before trusting scores
Track average scores: Use as your primary metric for prompt iteration

Model graders transform your evaluation pipeline from a data collection tool into a measurement system, enabling objective comparison of prompt versions and data-driven optimization.

Downloads

📓 001_prompt_evals_grader.ipynb — Complete Jupyter notebook with grading implementation

Next up: Learn how to iterate on prompts using grading feedback to systematically improve your results.