Published on

Code-Based Grading for AI-Generated Code

Authors
  • avatar
    Name
    Anablock
    Twitter

    AI Insights & Innovations

claude 4

Code-Based Grading for AI-Generated Code

When evaluating AI models that generate code, you need more than just checking if the response makes sense. You also need to verify that the generated code actually has valid syntax and follows the correct format. This is where code-based grading comes in.

Code graders provide fast, deterministic, and cost-free validation of structural and syntactic correctness—essential for any code generation workflow. While model graders assess semantic quality, code graders ensure your outputs are technically sound.


How Code Grading Works

Code grading validates three key aspects of AI-generated responses:

1. Format Compliance

The response should return only the requested code type (Python, JSON, or Regex) without explanations

2. Valid Syntax

The generated code should actually parse correctly as the intended language

3. Task Following

The response should directly address what was asked and be accurate

The first two criteria are handled by the code grader, while task following is evaluated by the model grader. Together, they provide a comprehensive evaluation.


Why Code Grading Matters

The Problem

AI models often generate syntactically invalid code, especially when:

  • The task is complex or ambiguous
  • The prompt doesn't specify format requirements
  • The model adds explanatory text around the code
  • Edge cases trigger unusual output patterns

The Solution

Code graders catch these issues immediately:

Instant feedback: No need to manually test every output
Zero cost: No API calls required
Deterministic: Same input always produces same result
Objective: Either the code parses or it doesn't


Syntax Validation Functions

To check if generated code has valid syntax, you can create three helper functions that attempt to parse the output:

JSON Validator

import json

def validate_json(text):
    """
    Validates JSON syntax by attempting to parse the text.
    Returns 10 if valid, 0 if invalid.
    """
    try:
        json.loads(text.strip())
        return 10
    except json.JSONDecodeError:
        return 0

Python Validator

import ast

def validate_python(text):
    """
    Validates Python syntax using the AST parser.
    Returns 10 if valid, 0 if invalid.
    """
    try:
        ast.parse(text.strip())
        return 10
    except SyntaxError:
        return 0

Regex Validator

import re

def validate_regex(text):
    """
    Validates regex syntax by attempting to compile the pattern.
    Returns 10 if valid, 0 if invalid.
    """
    try:
        re.compile(text.strip())
        return 10
    except re.error:
        return 0

How These Work

Each function tries to parse the text as its respective format:

  1. JSON: Uses json.loads() to parse the string
  2. Python: Uses ast.parse() to build an abstract syntax tree
  3. Regex: Uses re.compile() to compile the pattern

If parsing succeeds, it returns a perfect score of 10. If it fails with an error, the syntax is invalid and returns 0.


Building the Code Grader

Now let's combine these validators into a unified code grader:

def grade_syntax(output, test_case):
    """
    Grades the syntax validity of generated code based on expected format.
    Returns a score from 0-10.
    """
    # Extract the expected format from test case
    expected_format = test_case.get("format", "").lower()
    
    # Extract code from markdown blocks if present
    code = extract_code_from_markdown(output)
    
    # Validate based on format
    if expected_format == "python":
        return validate_python(code)
    elif expected_format == "json":
        return validate_json(code)
    elif expected_format == "regex":
        return validate_regex(code)
    else:
        # Unknown format, can't validate
        return 5  # Neutral score

Helper: Extract Code from Markdown

AI models often wrap code in markdown blocks. This helper extracts the raw code:

import re

def extract_code_from_markdown(text):
    """
    Extracts code from markdown code blocks.
    If no code block found, returns the original text.
    """
    # Match ```language\ncode\n``` pattern
    pattern = r'```(?:\w+)?\n(.*?)\n```'
    match = re.search(pattern, text, re.DOTALL)
    
    if match:
        return match.group(1)
    
    # No code block found, return original
    return text

Dataset Format Requirements

For the code grader to know which validator to use, your test cases need to specify the expected output format:

Example Test Case

{
    "task": "Create a Python function to validate an AWS IAM username",
    "format": "python"
}

Updated Dataset Generation

You can update your dataset generation prompt to automatically include this format field:

def generate_dataset():
    prompt = """
Generate an evaluation dataset for a prompt evaluation. The dataset will be used to evaluate prompts that generate Python, JSON, or Regex specifically for AWS-related tasks.

Generate an array of JSON objects with this structure:
```json
[
  {
    "task": "Description of task",
    "format": "python" | "json" | "regex"
  },
  ...additional
]
  • Focus on tasks that can be solved by writing a single Python function, a single JSON object, or a single regex
  • Distribute evenly across all three formats
  • Focus on tasks that do not require writing much code

Please generate 6 objects (2 Python, 2 JSON, 2 Regex). """

messages = []
add_user_message(messages, prompt)
add_assistant_message(messages, "```json")
text = chat(messages, stop_sequences=["```"])
return json.loads(text)

### Example Generated Dataset

```json
[
  {
    "task": "Write a Python function that lists all S3 buckets in an AWS account",
    "format": "python"
  },
  {
    "task": "Create a JSON configuration for an AWS Lambda function with 512MB memory and a 30-second timeout",
    "format": "json"
  },
  {
    "task": "Write a regex pattern to validate AWS ARN format for S3 buckets",
    "format": "regex"
  },
  {
    "task": "Write a Python function to parse an AWS CloudWatch log timestamp",
    "format": "python"
  },
  {
    "task": "Create a JSON IAM policy that allows read-only access to a specific S3 bucket",
    "format": "json"
  },
  {
    "task": "Write a regex to extract the region from an AWS ARN",
    "format": "regex"
  }
]

Improving Prompt Clarity

To get better results from your AI model, make your prompt instructions more specific about the expected output format:

Updated Prompt Template

def run_prompt(test_case):
    """Merges the prompt and test case input, then returns the result"""
    prompt = f"""
Please solve the following task:

{test_case["task"]}

Important instructions:
* Respond only with Python, JSON, or a plain Regex
* Do not add any comments or commentary or explanation
* Return clean, executable code only
"""
    
    messages = []
    add_user_message(messages, prompt)
    
    # Prefill with code block to encourage clean output
    add_assistant_message(messages, "```code")
    
    output = chat(messages, stop_sequences=["```"])
    return output

Why This Works

  1. Explicit format requirement: "Respond only with Python, JSON, or a plain Regex"
  2. No explanations: "Do not add any comments or commentary or explanation"
  3. Prefilled code block: Starting with ```code signals the model to generate code immediately
  4. Stop sequence: Ending at ``` ensures we capture only the code content

Combining Scores

The final step is merging the model grader score with the code grader score. A simple approach is to take the average:

def run_test_case(test_case):
    """Runs a single test case and grades the result with both graders"""
    output = run_prompt(test_case)
    
    # Model grader: Evaluate task completion and quality
    model_grade = grade_by_model(test_case, output)
    model_score = model_grade["score"]
    
    # Code grader: Validate syntax
    syntax_score = grade_syntax(output, test_case)
    
    # Combined score (equal weight)
    score = (model_score + syntax_score) / 2
    
    return {
        "output": output,
        "test_case": test_case,
        "score": score,
        "model_score": model_score,
        "syntax_score": syntax_score,
        "reasoning": model_grade["reasoning"]
    }

Weighted Scoring

You might adjust these weights based on what matters more for your specific use case:

# Prioritize syntax correctness (70% syntax, 30% quality)
score = (syntax_score * 0.7) + (model_score * 0.3)

# Prioritize task completion (30% syntax, 70% quality)
score = (syntax_score * 0.3) + (model_score * 0.7)

# Require both (minimum of the two)
score = min(syntax_score, model_score)

Example Evaluation Results

Before Prompt Improvements

📊 Evaluation Results (Baseline Prompt)

Test Case 1: Python S3 bucket lister
  Model Score: 8/10
  Syntax Score: 0/10 (SyntaxError: includes explanatory text)
  Combined: 4/10

Test Case 2: JSON Lambda config
  Model Score: 9/10
  Syntax Score: 10/10
  Combined: 9.5/10

Test Case 3: Regex ARN validator
  Model Score: 7/10
  Syntax Score: 0/10 (re.error: invalid escape sequence)
  Combined: 3.5/10

Average Score: 5.67/10

After Prompt Improvements

📊 Evaluation Results (Improved Prompt)

Test Case 1: Python S3 bucket lister
  Model Score: 9/10
  Syntax Score: 10/10
  Combined: 9.5/10

Test Case 2: JSON Lambda config
  Model Score: 9/10
  Syntax Score: 10/10
  Combined: 9.5/10

Test Case 3: Regex ARN validator
  Model Score: 8/10
  Syntax Score: 10/10
  Combined: 9/10

Average Score: 9.33/10 (+3.66 improvement)

Testing Your Implementation

Once you've implemented code grading, run your evaluation to get a baseline score:

# Load dataset
with open("dataset.json", "r") as f:
    dataset = json.load(f)

# Run evaluation
results = run_eval(dataset)

# Analyze results
print("\n📊 Detailed Results:\n")
for i, result in enumerate(results, 1):
    print(f"Test Case {i}: {result['test_case']['task'][:50]}...")
    print(f"  Model Score: {result['model_score']}/10")
    print(f"  Syntax Score: {result['syntax_score']}/10")
    print(f"  Combined: {result['score']}/10")
    if result['syntax_score'] == 0:
        print(f"  ⚠️  Syntax validation failed!")
    print()

What to Look For

  • Syntax failures: If syntax_score is 0, the code doesn't parse
  • Score distribution: Are certain formats (Python/JSON/Regex) performing worse?
  • Improvement opportunities: Low model scores suggest prompt clarity issues

Full Code Example

Here's the complete code grading implementation:

import json
import ast
import re
from statistics import mean

def validate_json(text):
    try:
        json.loads(text.strip())
        return 10
    except json.JSONDecodeError:
        return 0

def validate_python(text):
    try:
        ast.parse(text.strip())
        return 10
    except SyntaxError:
        return 0

def validate_regex(text):
    try:
        re.compile(text.strip())
        return 10
    except re.error:
        return 0

def extract_code_from_markdown(text):
    pattern = r'```(?:\w+)?\n(.*?)\n```'
    match = re.search(pattern, text, re.DOTALL)
    return match.group(1) if match else text

def grade_syntax(output, test_case):
    expected_format = test_case.get("format", "").lower()
    code = extract_code_from_markdown(output)
    
    if expected_format == "python":
        return validate_python(code)
    elif expected_format == "json":
        return validate_json(code)
    elif expected_format == "regex":
        return validate_regex(code)
    else:
        return 5

def run_prompt(test_case):
    prompt = f"""
Please solve the following task:

{test_case["task"]}

Important instructions:
* Respond only with Python, JSON, or a plain Regex
* Do not add any comments or commentary or explanation
* Return clean, executable code only
"""
    
    messages = []
    add_user_message(messages, prompt)
    add_assistant_message(messages, "```code")
    output = chat(messages, stop_sequences=["```"])
    return output

def run_test_case(test_case):
    output = run_prompt(test_case)
    
    model_grade = grade_by_model(test_case, output)
    model_score = model_grade["score"]
    syntax_score = grade_syntax(output, test_case)
    score = (model_score + syntax_score) / 2
    
    return {
        "output": output,
        "test_case": test_case,
        "score": score,
        "model_score": model_score,
        "syntax_score": syntax_score,
        "reasoning": model_grade["reasoning"]
    }

def run_eval(dataset):
    results = []
    
    for test_case in dataset:
        result = run_test_case(test_case)
        results.append(result)
    
    average_score = mean([result["score"] for result in results])
    print(f"📊 Average score: {average_score:.2f}/10")
    
    # Show syntax pass rate
    syntax_passes = sum(1 for r in results if r["syntax_score"] == 10)
    print(f"✅ Syntax validation: {syntax_passes}/{len(results)} passed")
    
    return results

Key Takeaways

  1. Code graders validate structure: Use them for syntax, format, and technical correctness
  2. Model graders validate semantics: Use them for task completion and quality
  3. Combine both for comprehensive evaluation: Syntax + quality = robust assessment
  4. Specify format in test cases: Include "format": "python" so the grader knows what to validate
  5. Improve prompts iteratively: Use scores to measure progress objectively
  6. Prefill for cleaner output: Starting with ```code reduces explanatory text

The score itself isn't inherently good or bad—what matters is whether you can improve it by refining your prompts. This gives you a quantitative way to measure prompt engineering progress rather than relying on subjective assessment.


Next Steps

Now that you have both code and model grading in place, you can:

  1. Run baseline evaluation: Get your starting score
  2. Identify failure patterns: Which formats or tasks perform worst?
  3. Iterate on prompts: Add instructions, examples, or constraints
  4. Re-evaluate: Measure improvement objectively
  5. Deploy with confidence: Know your prompt works across diverse inputs

Continue to the next guide on iterating and optimizing prompts using evaluation feedback to systematically improve your results.