- Published on
Code-Based Grading for AI-Generated Code
- Authors

- Name
- Anablock
AI Insights & Innovations

Code-Based Grading for AI-Generated Code
When evaluating AI models that generate code, you need more than just checking if the response makes sense. You also need to verify that the generated code actually has valid syntax and follows the correct format. This is where code-based grading comes in.
Code graders provide fast, deterministic, and cost-free validation of structural and syntactic correctness—essential for any code generation workflow. While model graders assess semantic quality, code graders ensure your outputs are technically sound.
How Code Grading Works
Code grading validates three key aspects of AI-generated responses:
1. Format Compliance
The response should return only the requested code type (Python, JSON, or Regex) without explanations
2. Valid Syntax
The generated code should actually parse correctly as the intended language
3. Task Following
The response should directly address what was asked and be accurate
The first two criteria are handled by the code grader, while task following is evaluated by the model grader. Together, they provide a comprehensive evaluation.
Why Code Grading Matters
The Problem
AI models often generate syntactically invalid code, especially when:
- The task is complex or ambiguous
- The prompt doesn't specify format requirements
- The model adds explanatory text around the code
- Edge cases trigger unusual output patterns
The Solution
Code graders catch these issues immediately:
✅ Instant feedback: No need to manually test every output
✅ Zero cost: No API calls required
✅ Deterministic: Same input always produces same result
✅ Objective: Either the code parses or it doesn't
Syntax Validation Functions
To check if generated code has valid syntax, you can create three helper functions that attempt to parse the output:
JSON Validator
import json
def validate_json(text):
"""
Validates JSON syntax by attempting to parse the text.
Returns 10 if valid, 0 if invalid.
"""
try:
json.loads(text.strip())
return 10
except json.JSONDecodeError:
return 0
Python Validator
import ast
def validate_python(text):
"""
Validates Python syntax using the AST parser.
Returns 10 if valid, 0 if invalid.
"""
try:
ast.parse(text.strip())
return 10
except SyntaxError:
return 0
Regex Validator
import re
def validate_regex(text):
"""
Validates regex syntax by attempting to compile the pattern.
Returns 10 if valid, 0 if invalid.
"""
try:
re.compile(text.strip())
return 10
except re.error:
return 0
How These Work
Each function tries to parse the text as its respective format:
- JSON: Uses
json.loads()to parse the string - Python: Uses
ast.parse()to build an abstract syntax tree - Regex: Uses
re.compile()to compile the pattern
If parsing succeeds, it returns a perfect score of 10. If it fails with an error, the syntax is invalid and returns 0.
Building the Code Grader
Now let's combine these validators into a unified code grader:
def grade_syntax(output, test_case):
"""
Grades the syntax validity of generated code based on expected format.
Returns a score from 0-10.
"""
# Extract the expected format from test case
expected_format = test_case.get("format", "").lower()
# Extract code from markdown blocks if present
code = extract_code_from_markdown(output)
# Validate based on format
if expected_format == "python":
return validate_python(code)
elif expected_format == "json":
return validate_json(code)
elif expected_format == "regex":
return validate_regex(code)
else:
# Unknown format, can't validate
return 5 # Neutral score
Helper: Extract Code from Markdown
AI models often wrap code in markdown blocks. This helper extracts the raw code:
import re
def extract_code_from_markdown(text):
"""
Extracts code from markdown code blocks.
If no code block found, returns the original text.
"""
# Match ```language\ncode\n``` pattern
pattern = r'```(?:\w+)?\n(.*?)\n```'
match = re.search(pattern, text, re.DOTALL)
if match:
return match.group(1)
# No code block found, return original
return text
Dataset Format Requirements
For the code grader to know which validator to use, your test cases need to specify the expected output format:
Example Test Case
{
"task": "Create a Python function to validate an AWS IAM username",
"format": "python"
}
Updated Dataset Generation
You can update your dataset generation prompt to automatically include this format field:
def generate_dataset():
prompt = """
Generate an evaluation dataset for a prompt evaluation. The dataset will be used to evaluate prompts that generate Python, JSON, or Regex specifically for AWS-related tasks.
Generate an array of JSON objects with this structure:
```json
[
{
"task": "Description of task",
"format": "python" | "json" | "regex"
},
...additional
]
- Focus on tasks that can be solved by writing a single Python function, a single JSON object, or a single regex
- Distribute evenly across all three formats
- Focus on tasks that do not require writing much code
Please generate 6 objects (2 Python, 2 JSON, 2 Regex). """
messages = []
add_user_message(messages, prompt)
add_assistant_message(messages, "```json")
text = chat(messages, stop_sequences=["```"])
return json.loads(text)
### Example Generated Dataset
```json
[
{
"task": "Write a Python function that lists all S3 buckets in an AWS account",
"format": "python"
},
{
"task": "Create a JSON configuration for an AWS Lambda function with 512MB memory and a 30-second timeout",
"format": "json"
},
{
"task": "Write a regex pattern to validate AWS ARN format for S3 buckets",
"format": "regex"
},
{
"task": "Write a Python function to parse an AWS CloudWatch log timestamp",
"format": "python"
},
{
"task": "Create a JSON IAM policy that allows read-only access to a specific S3 bucket",
"format": "json"
},
{
"task": "Write a regex to extract the region from an AWS ARN",
"format": "regex"
}
]
Improving Prompt Clarity
To get better results from your AI model, make your prompt instructions more specific about the expected output format:
Updated Prompt Template
def run_prompt(test_case):
"""Merges the prompt and test case input, then returns the result"""
prompt = f"""
Please solve the following task:
{test_case["task"]}
Important instructions:
* Respond only with Python, JSON, or a plain Regex
* Do not add any comments or commentary or explanation
* Return clean, executable code only
"""
messages = []
add_user_message(messages, prompt)
# Prefill with code block to encourage clean output
add_assistant_message(messages, "```code")
output = chat(messages, stop_sequences=["```"])
return output
Why This Works
- Explicit format requirement: "Respond only with Python, JSON, or a plain Regex"
- No explanations: "Do not add any comments or commentary or explanation"
- Prefilled code block: Starting with
```codesignals the model to generate code immediately - Stop sequence: Ending at
```ensures we capture only the code content
Combining Scores
The final step is merging the model grader score with the code grader score. A simple approach is to take the average:
def run_test_case(test_case):
"""Runs a single test case and grades the result with both graders"""
output = run_prompt(test_case)
# Model grader: Evaluate task completion and quality
model_grade = grade_by_model(test_case, output)
model_score = model_grade["score"]
# Code grader: Validate syntax
syntax_score = grade_syntax(output, test_case)
# Combined score (equal weight)
score = (model_score + syntax_score) / 2
return {
"output": output,
"test_case": test_case,
"score": score,
"model_score": model_score,
"syntax_score": syntax_score,
"reasoning": model_grade["reasoning"]
}
Weighted Scoring
You might adjust these weights based on what matters more for your specific use case:
# Prioritize syntax correctness (70% syntax, 30% quality)
score = (syntax_score * 0.7) + (model_score * 0.3)
# Prioritize task completion (30% syntax, 70% quality)
score = (syntax_score * 0.3) + (model_score * 0.7)
# Require both (minimum of the two)
score = min(syntax_score, model_score)
Example Evaluation Results
Before Prompt Improvements
📊 Evaluation Results (Baseline Prompt)
Test Case 1: Python S3 bucket lister
Model Score: 8/10
Syntax Score: 0/10 (SyntaxError: includes explanatory text)
Combined: 4/10
Test Case 2: JSON Lambda config
Model Score: 9/10
Syntax Score: 10/10
Combined: 9.5/10
Test Case 3: Regex ARN validator
Model Score: 7/10
Syntax Score: 0/10 (re.error: invalid escape sequence)
Combined: 3.5/10
Average Score: 5.67/10
After Prompt Improvements
📊 Evaluation Results (Improved Prompt)
Test Case 1: Python S3 bucket lister
Model Score: 9/10
Syntax Score: 10/10
Combined: 9.5/10
Test Case 2: JSON Lambda config
Model Score: 9/10
Syntax Score: 10/10
Combined: 9.5/10
Test Case 3: Regex ARN validator
Model Score: 8/10
Syntax Score: 10/10
Combined: 9/10
Average Score: 9.33/10 (+3.66 improvement)
Testing Your Implementation
Once you've implemented code grading, run your evaluation to get a baseline score:
# Load dataset
with open("dataset.json", "r") as f:
dataset = json.load(f)
# Run evaluation
results = run_eval(dataset)
# Analyze results
print("\n📊 Detailed Results:\n")
for i, result in enumerate(results, 1):
print(f"Test Case {i}: {result['test_case']['task'][:50]}...")
print(f" Model Score: {result['model_score']}/10")
print(f" Syntax Score: {result['syntax_score']}/10")
print(f" Combined: {result['score']}/10")
if result['syntax_score'] == 0:
print(f" ⚠️ Syntax validation failed!")
print()
What to Look For
- Syntax failures: If
syntax_scoreis 0, the code doesn't parse - Score distribution: Are certain formats (Python/JSON/Regex) performing worse?
- Improvement opportunities: Low model scores suggest prompt clarity issues
Full Code Example
Here's the complete code grading implementation:
import json
import ast
import re
from statistics import mean
def validate_json(text):
try:
json.loads(text.strip())
return 10
except json.JSONDecodeError:
return 0
def validate_python(text):
try:
ast.parse(text.strip())
return 10
except SyntaxError:
return 0
def validate_regex(text):
try:
re.compile(text.strip())
return 10
except re.error:
return 0
def extract_code_from_markdown(text):
pattern = r'```(?:\w+)?\n(.*?)\n```'
match = re.search(pattern, text, re.DOTALL)
return match.group(1) if match else text
def grade_syntax(output, test_case):
expected_format = test_case.get("format", "").lower()
code = extract_code_from_markdown(output)
if expected_format == "python":
return validate_python(code)
elif expected_format == "json":
return validate_json(code)
elif expected_format == "regex":
return validate_regex(code)
else:
return 5
def run_prompt(test_case):
prompt = f"""
Please solve the following task:
{test_case["task"]}
Important instructions:
* Respond only with Python, JSON, or a plain Regex
* Do not add any comments or commentary or explanation
* Return clean, executable code only
"""
messages = []
add_user_message(messages, prompt)
add_assistant_message(messages, "```code")
output = chat(messages, stop_sequences=["```"])
return output
def run_test_case(test_case):
output = run_prompt(test_case)
model_grade = grade_by_model(test_case, output)
model_score = model_grade["score"]
syntax_score = grade_syntax(output, test_case)
score = (model_score + syntax_score) / 2
return {
"output": output,
"test_case": test_case,
"score": score,
"model_score": model_score,
"syntax_score": syntax_score,
"reasoning": model_grade["reasoning"]
}
def run_eval(dataset):
results = []
for test_case in dataset:
result = run_test_case(test_case)
results.append(result)
average_score = mean([result["score"] for result in results])
print(f"📊 Average score: {average_score:.2f}/10")
# Show syntax pass rate
syntax_passes = sum(1 for r in results if r["syntax_score"] == 10)
print(f"✅ Syntax validation: {syntax_passes}/{len(results)} passed")
return results
Key Takeaways
- Code graders validate structure: Use them for syntax, format, and technical correctness
- Model graders validate semantics: Use them for task completion and quality
- Combine both for comprehensive evaluation: Syntax + quality = robust assessment
- Specify format in test cases: Include
"format": "python"so the grader knows what to validate - Improve prompts iteratively: Use scores to measure progress objectively
- Prefill for cleaner output: Starting with
```codereduces explanatory text
The score itself isn't inherently good or bad—what matters is whether you can improve it by refining your prompts. This gives you a quantitative way to measure prompt engineering progress rather than relying on subjective assessment.
Next Steps
Now that you have both code and model grading in place, you can:
- Run baseline evaluation: Get your starting score
- Identify failure patterns: Which formats or tasks perform worst?
- Iterate on prompts: Add instructions, examples, or constraints
- Re-evaluate: Measure improvement objectively
- Deploy with confidence: Know your prompt works across diverse inputs
Continue to the next guide on iterating and optimizing prompts using evaluation feedback to systematically improve your results.