Published on

Building a Custom Prompt Evaluation Workflow

Authors
  • avatar
    Name
    Vuk Dukic
    Twitter

    Founder, Senior Software Engineer

Claude 3

Building a Custom Prompt Evaluation Workflow

Building a custom prompt evaluation workflow starts with creating a solid prompt and then generating test data to see how well it performs. Let's walk through setting up an evaluation system for a prompt that helps users write AWS-specific code.


Setting Up the Goal

Our prompt needs to assist users in writing three specific types of output for AWS use cases:

  1. Python code
  2. JSON configuration files
  3. Regular expressions

The key requirement is that when a user requests help with a task, we return clean output in one of these formats without any extra explanations, headers, or footers.

Starting Prompt (Version 1)

Here's our baseline prompt:

prompt = f"""
Please provide a solution to the following task:
{task}
"""

This simple template will serve as our starting point. We'll evaluate its performance and iterate from there.


Creating an Evaluation Dataset

An evaluation dataset contains inputs that we'll feed into our prompt. For each combination of prompt and input, we'll run the prompt and analyze the results.

Our dataset will be an array of JSON objects, where each object contains a "task" property describing what we want Claude to accomplish. We can either create this dataset by hand or generate it automatically using Claude.

Since we're generating test data, this is a perfect opportunity to use a faster model like Haiku instead of the full Claude model—it's more cost-effective and still produces high-quality synthetic data.


Generating Test Data with Code

Let's create a function that automatically generates our test dataset. First, we'll need our helper functions for working with Claude:

Helper Functions

def add_user_message(messages, text):
    user_message = {"role": "user", "content": text}
    messages.append(user_message)

def add_assistant_message(messages, text):
    assistant_message = {"role": "assistant", "content": text}
    messages.append(assistant_message)

def chat(messages, system=None, temperature=1.0, stop_sequences=[]):
    params = {
        "model": model,
        "max_tokens": 1000,
        "messages": messages,
        "temperature": temperature
    }
    if system:
        params["system"] = system
    if stop_sequences:
        params["stop_sequences"] = stop_sequences
    
    response = client.messages.create(**params)
    return response.content[0].text

These helper functions make it easy to build conversation histories and send requests to Claude.


Dataset Generation Function

Now we'll create our dataset generation function:

def generate_dataset():
    prompt = """
Generate an evaluation dataset for a prompt evaluation. The dataset will be used to evaluate prompts that generate Python, JSON, or Regex specifically for AWS-related tasks. Generate an array of JSON objects, each representing a task that requires Python, JSON, or a Regex to complete.

Example output:
```json
[
  {
    "task": "Description of task",
  },
  ...additional
]
  • Focus on tasks that can be solved by writing a single Python function, a single JSON object, or a single regex
  • Focus on tasks that do not require writing much code

Please generate 3 objects. """


### Using Prefilling and Stop Sequences

To properly parse the JSON response, we'll use **prefilling** and **stop sequences**:

```python
    messages = []
    add_user_message(messages, prompt)
    add_assistant_message(messages, "```json")
    text = chat(messages, stop_sequences=["```"])
    return json.loads(text)

What's happening here:

  1. Prefilling: We add an assistant message with "```json" to guide Claude to start its response with a JSON code block
  2. Stop sequences: We tell Claude to stop generating when it reaches the closing ``` marker
  3. Parsing: We extract the JSON content between the markers and parse it into a Python object

This technique ensures we get clean, parseable JSON output every time.


Testing the Dataset Generation

Let's run our function and see what kind of test cases we get:

dataset = generate_dataset()
print(dataset)

Example Output

You might see something like this:

[
  {
    "task": "Write a Python function that lists all S3 buckets in an AWS account"
  },
  {
    "task": "Create a JSON configuration for an AWS Lambda function with 512MB memory and a 30-second timeout"
  },
  {
    "task": "Write a regex pattern to validate AWS ARN format for S3 buckets"
  }
]

This returns three different test cases covering our target outputs:

  • Python functions for AWS SDK operations
  • JSON configurations for AWS service settings
  • Regular expressions for AWS-specific validation

Saving the Dataset

Once we have our dataset, we'll save it to a file so we can easily load it later during evaluation:

with open('dataset.json', 'w') as f:
    json.dump(dataset, f, indent=2)

This creates a dataset.json file in the same directory as your notebook, containing your list of tasks ready for prompt evaluation.

Loading the Dataset Later

When you're ready to run evaluations, you can load the dataset like this:

with open('dataset.json', 'r') as f:
    dataset = json.load(f)

Why This Approach Works

1. Automation Saves Time

Manually writing dozens or hundreds of test cases is tedious. Using Claude to generate them is faster and ensures diversity.

2. Consistency

The JSON structure ensures every test case has the same format, making it easy to iterate through during evaluation.

3. Scalability

Want 100 test cases instead of 3? Just change the prompt. Want to add new task types? Update the generation instructions.

4. Cost-Effective

Using Haiku for dataset generation keeps costs low while maintaining quality.


Next Steps: Running the Evaluation

With this foundation in place, you now have a systematic way to generate test data for evaluating how well your prompts perform across different types of AWS-related coding tasks.

The next steps in your evaluation workflow would be:

  1. Feed each task through your prompt to generate responses
  2. Grade the responses using an LLM-as-judge or rule-based criteria
  3. Calculate average scores to measure prompt performance
  4. Iterate on your prompt and re-run the evaluation
  5. Compare versions to find the best-performing prompt

Example Grading Criteria for AWS Code Tasks

When evaluating responses, you might check:

  • Correctness: Does the code/JSON/regex actually work?
  • Format compliance: Is it clean output without extra explanations?
  • AWS best practices: Does it follow AWS conventions and security guidelines?
  • Completeness: Does it fully solve the task?
  • Efficiency: Is the solution concise and well-structured?

Full Code Example

Here's the complete code in one place:

import json
from anthropic import Anthropic

client = Anthropic(api_key="your-api-key")
model = "claude-3-haiku-20240307"

def add_user_message(messages, text):
    user_message = {"role": "user", "content": text}
    messages.append(user_message)

def add_assistant_message(messages, text):
    assistant_message = {"role": "assistant", "content": text}
    messages.append(assistant_message)

def chat(messages, system=None, temperature=1.0, stop_sequences=[]):
    params = {
        "model": model,
        "max_tokens": 1000,
        "messages": messages,
        "temperature": temperature
    }
    if system:
        params["system"] = system
    if stop_sequences:
        params["stop_sequences"] = stop_sequences
    
    response = client.messages.create(**params)
    return response.content[0].text

def generate_dataset():
    prompt = """
Generate an evaluation dataset for a prompt evaluation. The dataset will be used to evaluate prompts that generate Python, JSON, or Regex specifically for AWS-related tasks. Generate an array of JSON objects, each representing a task that requires Python, JSON, or a Regex to complete.

Example output:
```json
[
  {
    "task": "Description of task",
  },
  ...additional
]
  • Focus on tasks that can be solved by writing a single Python function, a single JSON object, or a single regex
  • Focus on tasks that do not require writing much code

Please generate 3 objects. """

messages = []
add_user_message(messages, prompt)
add_assistant_message(messages, "```json")
text = chat(messages, stop_sequences=["```"])
return json.loads(text)

Generate and save dataset

dataset = generate_dataset() print(json.dumps(dataset, indent=2))

with open('dataset.json', 'w') as f: json.dump(dataset, f, indent=2)

print("\n✅ Dataset saved to dataset.json")


---

## Key Takeaways

1. **Start with a clear goal**: Define what outputs your prompt needs to produce
2. **Automate dataset generation**: Use Claude to create diverse, representative test cases
3. **Use the right model**: Haiku is perfect for generating synthetic eval data
4. **Leverage prefilling and stop sequences**: Get clean, parseable JSON output every time
5. **Save your datasets**: Reuse them across multiple prompt iterations

With this systematic approach, you can build robust evaluation workflows that help you confidently improve your prompts over time.

---

## Downloads

📓 **[001_prompt_evals.ipynb](001_prompt_evals.ipynb)** — Complete Jupyter notebook with all code examples

---

*Ready to take your prompt evaluation further? Check out our guide on [grading responses and iterating on prompts](#) to complete the full evaluation cycle.*