Published on

Model-Based Grading for Prompt Evaluation

Authors
  • avatar
    Name
    Anablock
    Twitter

    AI Insights & Innovations

Claude Advanced

Model-Based Grading for Prompt Evaluation

When building prompt evaluation workflows, grading systems provide objective signals about output quality. A grader takes model output and returns some kind of measurable feedback—typically a number between 1 and 10, where 10 represents high quality and 1 represents poor quality.

Grading is the critical component that transforms your evaluation pipeline from a data collection tool into a measurement system. Without grading, you can see what your prompt produces, but you can't objectively compare different versions or track improvements over time.


Types of Graders

There are three main approaches to grading model outputs:

1. Code Graders

Programmatically evaluate outputs using custom logic

2. Model Graders

Use another AI model to assess the quality

3. Human Graders

Have people manually review and score outputs

Each approach has distinct strengths and use cases. Let's explore them in detail.


Code Graders

Code graders let you implement any programmatic check you can imagine. They're fast, deterministic, and cost-free once implemented.

Common Use Cases

  • Checking output length: Ensure responses aren't too brief or verbose
  • Verifying output does/doesn't have certain words: Flag unwanted phrases or required terminology
  • Syntax validation: Check if JSON, Python, or regex is valid
  • Readability scores: Calculate Flesch-Kincaid or similar metrics
  • Format compliance: Verify outputs match expected structure

Example: Simple Code Grader

def grade_by_code(test_case, output):
    score = 10
    issues = []
    
    # Check 1: Output shouldn't be too short
    if len(output) < 20:
        score -= 3
        issues.append("Output too brief")
    
    # Check 2: Should not include explanatory text
    forbidden_phrases = ["Here's", "Let me explain", "This code"]
    for phrase in forbidden_phrases:
        if phrase in output:
            score -= 2
            issues.append(f"Contains explanatory phrase: '{phrase}'")
    
    # Check 3: Should contain code block markers
    if "```" not in output:
        score -= 1
        issues.append("Missing code block formatting")
    
    return {
        "score": max(1, score),
        "issues": issues
    }

Advantages of Code Graders

Fast: No API calls, instant results
Deterministic: Same input always produces same score
Free: No additional costs
Transparent: Easy to understand why a score was assigned

Limitations

Rigid: Can't assess nuanced quality or semantic correctness
Brittle: May miss edge cases or produce false positives
Limited scope: Best for format/structure, not content quality


Model Graders

Model graders feed your original output into another API call for evaluation. This approach offers tremendous flexibility for assessing subjective or complex criteria.

Common Use Cases

  • Response quality: Overall coherence and usefulness
  • Instruction following: How well the output matches the request
  • Completeness: Whether all aspects of the task are addressed
  • Helpfulness: Practical value to the end user
  • Safety: Absence of harmful, biased, or inappropriate content
  • Accuracy: Factual correctness (when ground truth is available)

Why Model Graders Work

LLMs excel at nuanced evaluation tasks that would be difficult to encode in rules:

  • Understanding semantic similarity
  • Assessing tone and style
  • Evaluating logical coherence
  • Judging task completion

Advantages of Model Graders

Flexible: Can evaluate any criteria you can describe
Nuanced: Captures subtle quality differences
Scalable: Handles diverse output types
Consistent: More reliable than human graders at scale

Limitations

Costly: Requires additional API calls
Slower: Adds latency to evaluation pipeline
Variable: Can produce inconsistent scores across runs
Opaque: Harder to debug than code graders


Human Graders

Human graders provide the most flexibility but are time-consuming and tedious. They're the gold standard for subjective quality assessment.

Common Use Cases

  • General response quality: Overall impression and satisfaction
  • Comprehensiveness: Depth and breadth of coverage
  • Depth: Level of detail and insight
  • Conciseness: Efficiency of communication
  • Relevance: Alignment with user intent
  • Edge case validation: Unusual scenarios that automated graders miss

When to Use Human Graders

  • High-stakes applications: Medical, legal, or safety-critical domains
  • Calibration: Establishing ground truth for training automated graders
  • Spot-checking: Validating automated grader accuracy
  • Final validation: Before production deployment

Advantages

Most accurate: Humans understand context and nuance best
Catches edge cases: Identifies issues automated systems miss
Builds intuition: Helps you understand failure modes

Limitations

Expensive: Requires paid human time
Slow: Bottleneck for rapid iteration
Inconsistent: Inter-rater reliability varies
Not scalable: Can't evaluate thousands of outputs


Defining Evaluation Criteria

Before implementing any grader, you need clear evaluation criteria. Vague criteria lead to unreliable scores.

Example: Code Generation Prompt

For a prompt that generates AWS-specific code, you might define:

1. Format Compliance

  • Requirement: Should return only Python, JSON, or Regex without explanation
  • Grader type: Code grader
  • Scoring: -3 points for each explanatory sentence

2. Valid Syntax

  • Requirement: Produced code should have valid syntax
  • Grader type: Code grader (using AST parsing for Python, JSON.parse for JSON)
  • Scoring: 0 points if syntax is invalid, no deduction if valid

3. Task Following

  • Requirement: Response should directly address the user's task with accurate code
  • Grader type: Model grader
  • Scoring: 1-10 scale based on correctness and completeness

The first two criteria work well with code graders because they're objective and rule-based. Task following is better suited for model graders due to the flexibility required to assess semantic correctness.


Implementing a Model Grader

Here's how to build a robust model grader function:

def grade_by_model(test_case, output):
    """
    Uses an LLM to evaluate the quality of generated output.
    Returns a structured evaluation with score, strengths, weaknesses, and reasoning.
    """
    
    # Create evaluation prompt
    eval_prompt = f"""
You are an expert code reviewer. Evaluate this AI-generated solution.

**Task**: {test_case["task"]}

**Solution**: 
{output}

Provide your evaluation as a structured JSON object with:
- "strengths": An array of 1-3 key strengths
- "weaknesses": An array of 1-3 key areas for improvement  
- "reasoning": A concise explanation of your assessment
- "score": A number between 1-10 (where 10 is perfect)

Evaluation criteria:
1. Correctness: Does the code solve the task accurately?
2. Format: Is it clean output without unnecessary explanation?
3. Best practices: Does it follow AWS/Python conventions?
"""
    
    messages = []
    add_user_message(messages, eval_prompt)
    add_assistant_message(messages, "```json")
    
    eval_text = chat(messages, stop_sequences=["```"])
    return json.loads(eval_text)

Key Design Decisions

1. Ask for Structured Output

Using JSON ensures consistent parsing and makes it easy to extract the score programmatically.

2. Request Strengths and Weaknesses

Without this context, models tend to default to middling scores around 6. Asking for specific strengths and weaknesses forces the model to justify its score.

3. Include Reasoning

The reasoning field helps you understand why a score was assigned, making it easier to debug unexpected results.

4. Use Prefilling and Stop Sequences

Starting the assistant message with "```json" and using "```" as a stop sequence ensures clean JSON output.


Example Model Grader Output

{
  "strengths": [
    "Code is syntactically correct and uses boto3 properly",
    "Function includes proper error handling",
    "Follows Python naming conventions"
  ],
  "weaknesses": [
    "Includes unnecessary explanatory comments",
    "Example usage section wasn't requested",
    "Could be more concise"
  ],
  "reasoning": "The solution correctly solves the task and demonstrates good coding practices, but includes extra content beyond what was requested. The code itself is solid, but the format doesn't match the requirement for clean output only.",
  "score": 7
}

Integrating Grading into Your Workflow

Update your run_test_case function to call the grader:

def run_test_case(test_case):
    """Runs a single test case and grades the result"""
    output = run_prompt(test_case)
    
    # Grade the output using model grader
    model_grade = grade_by_model(test_case, output)
    score = model_grade["score"]
    reasoning = model_grade["reasoning"]
    strengths = model_grade["strengths"]
    weaknesses = model_grade["weaknesses"]
    
    return {
        "output": output, 
        "test_case": test_case, 
        "score": score,
        "reasoning": reasoning,
        "strengths": strengths,
        "weaknesses": weaknesses
    }

Calculating Average Scores

Finally, calculate an average score across all test cases to get a single metric for your prompt's performance:

from statistics import mean

def run_eval(dataset):
    """Runs evaluation on entire dataset and returns results with average score"""
    results = []
    
    for test_case in dataset:
        result = run_test_case(test_case)
        results.append(result)
    
    # Calculate average score
    average_score = mean([result["score"] for result in results])
    print(f"📊 Average score: {average_score:.2f}/10")
    
    # Show score distribution
    scores = [result["score"] for result in results]
    print(f"   Score range: {min(scores)} - {max(scores)}")
    print(f"   Total test cases: {len(results)}")
    
    return results

Example Output

📊 Average score: 7.33/10
   Score range: 6 - 9
   Total test cases: 3

This gives you an objective metric to track as you iterate on your prompt. While model graders can be somewhat variable, they provide a consistent baseline for measuring improvements.


Combining Code and Model Graders

For the most robust evaluation, combine both approaches:

def grade_combined(test_case, output):
    """Combines code-based and model-based grading"""
    
    # Code grader: Check format compliance
    code_grade = grade_by_code(test_case, output)
    
    # Model grader: Check task completion and quality
    model_grade = grade_by_model(test_case, output)
    
    # Weighted average (60% model, 40% code)
    final_score = (model_grade["score"] * 0.6) + (code_grade["score"] * 0.4)
    
    return {
        "score": final_score,
        "code_grade": code_grade,
        "model_grade": model_grade
    }

This hybrid approach leverages the strengths of both methods:

  • Code graders catch format violations quickly and cheaply
  • Model graders assess semantic quality and task completion

Best Practices for Model Graders

1. Be Specific in Evaluation Criteria

Vague instructions like "evaluate quality" produce inconsistent results. Instead:

eval_prompt = """
Evaluate on these specific criteria:
1. Correctness (0-4 points): Does the code work?
2. Format (0-3 points): Is it clean without explanation?
3. Best practices (0-3 points): Does it follow conventions?

Sum the points for a total score out of 10.
"""

2. Use Temperature 0 for Consistency

Lower temperature reduces variability in grading:

eval_text = chat(messages, temperature=0.0, stop_sequences=["```"])

3. Provide Examples of Good and Bad Outputs

Few-shot examples calibrate the grader:

eval_prompt = """
Example of a 10/10 response:
```python
import boto3
s3 = boto3.client('s3')
buckets = s3.list_buckets()

Example of a 5/10 response: Here's how to list S3 buckets. First, you need to import boto3... [includes unnecessary explanation]

Now evaluate this response: {output} """


### 4. **Validate Grader Reliability**
Run the same output through the grader multiple times to check consistency:

```python
scores = [grade_by_model(test_case, output)["score"] for _ in range(5)]
print(f"Score variance: {max(scores) - min(scores)}")

If variance is high (>2 points), refine your evaluation prompt.


Full Code Example

import json
from statistics import mean
from anthropic import Anthropic

client = Anthropic(api_key="your-api-key")
model = "claude-3-haiku-20240307"

def grade_by_model(test_case, output):
    eval_prompt = f"""
You are an expert code reviewer. Evaluate this AI-generated solution.

**Task**: {test_case["task"]}

**Solution**: 
{output}

Provide your evaluation as a structured JSON object with:
- "strengths": An array of 1-3 key strengths
- "weaknesses": An array of 1-3 key areas for improvement  
- "reasoning": A concise explanation of your assessment
- "score": A number between 1-10

Evaluation criteria:
1. Correctness: Does the code solve the task accurately?
2. Format: Is it clean output without unnecessary explanation?
3. Best practices: Does it follow conventions?
"""
    
    messages = []
    add_user_message(messages, eval_prompt)
    add_assistant_message(messages, "```json")
    
    eval_text = chat(messages, temperature=0.0, stop_sequences=["```"])
    return json.loads(eval_text)

def run_test_case(test_case):
    output = run_prompt(test_case)
    model_grade = grade_by_model(test_case, output)
    
    return {
        "output": output, 
        "test_case": test_case, 
        "score": model_grade["score"],
        "reasoning": model_grade["reasoning"],
        "strengths": model_grade["strengths"],
        "weaknesses": model_grade["weaknesses"]
    }

def run_eval(dataset):
    results = []
    
    for test_case in dataset:
        result = run_test_case(test_case)
        results.append(result)
    
    average_score = mean([result["score"] for result in results])
    print(f"📊 Average score: {average_score:.2f}/10")
    
    return results

# Run evaluation
with open("dataset.json", "r") as f:
    dataset = json.load(f)

results = run_eval(dataset)

Key Takeaways

  1. Choose the right grader type: Code graders for format/syntax, model graders for semantic quality, human graders for high-stakes validation
  2. Define clear criteria: Vague evaluation instructions produce unreliable scores
  3. Request structured output: JSON with score, reasoning, strengths, and weaknesses
  4. Combine approaches: Hybrid grading leverages strengths of multiple methods
  5. Validate consistency: Test grader reliability before trusting scores
  6. Track average scores: Use as your primary metric for prompt iteration

Model graders transform your evaluation pipeline from a data collection tool into a measurement system, enabling objective comparison of prompt versions and data-driven optimization.


Downloads

📓 001_prompt_evals_grader.ipynb — Complete Jupyter notebook with grading implementation


Next up: Learn how to iterate on prompts using grading feedback to systematically improve your results.