button-with-the-word-email-on-a-keyboard-communic-2026-01-07-01-36-06-utc.jpg

Prompt Engineering: Iterative Improvement Through Evaluation

Prompt engineering is about taking a prompt you've written and improving it to get more reliable, higher-quality outputs. This process involves iterative refinement—starting with a basic prompt, evaluating its performance, then systematically applying engineering techniques to improve it.

Unlike trial-and-error guessing, systematic prompt engineering uses objective measurement to guide improvements. Each change you make is validated through evaluation, ensuring you're actually making progress rather than just making changes.

The Iterative Improvement Process

The approach follows a clear cycle that you can repeat until you achieve your desired results:

1. Set a Goal

Define what you want your prompt to accomplish

2. Write an Initial Prompt

Create a basic first attempt

3. Evaluate the Prompt

Test it against your criteria

4. Apply Prompt Engineering Techniques

Use specific methods to improve performance

5. Re-evaluate

Verify that your changes actually improved the results

You repeat the last two steps until you're satisfied with the performance. Each iteration should show measurable improvement in your evaluation scores.

Why This Approach Works

Traditional Approach (Guesswork)

❌ Make random changes
❌ Subjectively judge if it "feels better"
❌ No way to know if you're improving
❌ Can't compare different versions objectively

Systematic Approach (Evaluation-Driven)

✅ Make targeted changes based on feedback
✅ Measure improvement with objective scores
✅ Track progress over time
✅ Confidently deploy the best-performing version

Setting Up Your Evaluation Pipeline

To demonstrate this process, we'll work with a practical example: creating a prompt that generates one-day meal plans for athletes. The prompt needs to take into account an athlete's height, weight, goals, and dietary restrictions, then produce a comprehensive meal plan.

The PromptEvaluator Class

The evaluation setup uses a PromptEvaluator class that handles dataset generation and model grading. When creating your evaluator instance, you can control concurrency with the max_concurrent_tasks parameter:

from prompt_evaluator import PromptEvaluator

evaluator = PromptEvaluator(max_concurrent_tasks=5)

Performance tip: Start with a low concurrency value (like 3) to avoid rate limit errors. You can increase it if your API quota allows for faster processing.

Generating Test Data

The evaluation system can automatically generate test cases based on your prompt requirements. You define what inputs your prompt needs:

dataset = evaluator.generate_dataset(
    task_description="Write a compact, concise 1 day meal plan for a single athlete",
    prompt_inputs_spec={
        "height": "Athlete's height in cm",
        "weight": "Athlete's weight in kg", 
        "goal": "Goal of the athlete",
        "restrictions": "Dietary restrictions of the athlete"
    },
    output_file="dataset.json",
    num_cases=3
)

What This Generates

The evaluator will create test cases like:

[
  {
    "height": "175 cm",
    "weight": "70 kg",
    "goal": "Build muscle mass",
    "restrictions": "Lactose intolerant"
  },
  {
    "height": "182 cm",
    "weight": "85 kg",
    "goal": "Lose body fat while maintaining strength",
    "restrictions": "Vegetarian, no soy"
  },
  {
    "height": "168 cm",
    "weight": "62 kg",
    "goal": "Improve endurance for marathon training",
    "restrictions": "Gluten-free"
  }
]

Development tip: Keep the number of test cases low (2-3) during development to speed up your iteration cycle. You can increase this to 10-20 for final validation.

Writing Your Initial Prompt

Start with a simple, naive prompt to establish a baseline. Here's an example of a deliberately basic first attempt:

def run_prompt(prompt_inputs):
    """
    Version 1: Baseline prompt (intentionally simple)
    """
    prompt = f"""
What should this person eat?

- Height: {prompt_inputs["height"]}
- Weight: {prompt_inputs["weight"]}
- Goal: {prompt_inputs["goal"]}
- Dietary restrictions: {prompt_inputs["restrictions"]}
"""
    
    messages = []
    add_user_message(messages, prompt)
    return chat(messages)

Why Start Simple?

This basic prompt will likely produce poor results, but it gives you:

A baseline score to measure improvement against
Clear evidence of what's missing
A foundation to build upon systematically

Adding Evaluation Criteria

When running your evaluation, you can specify additional criteria that the grading model should consider:

results = evaluator.run_evaluation(
    run_prompt_function=run_prompt,
    dataset_file="dataset.json",
    extra_criteria="""
The output should include:
- Daily caloric total
- Macronutrient breakdown (protein, carbs, fats in grams)
- Meals with exact foods, portions, and timing
- Alignment with the athlete's specific goal
- Respect for all dietary restrictions
"""
)

This helps ensure your prompt is evaluated against the specific requirements that matter for your use case.

Example Grading Criteria

The evaluator will assess each output on:

Completeness: Does it include all required elements?
Accuracy: Are calorie and macro calculations correct?
Specificity: Are portions and timing clearly defined?
Personalization: Does it address the athlete's goal and restrictions?
Usability: Is it easy to follow and implement?

Analyzing Results

After running an evaluation, you'll get both a numerical score and a detailed HTML report. The report shows you exactly how each test case performed, including the model's reasoning for each score.

Example Baseline Results

📊 Evaluation Results - Version 1 (Baseline)

Test Case 1: 175cm, 70kg, Build muscle, Lactose intolerant
  Score: 2/10
  Reasoning: "Response is too vague. Says 'eat protein and vegetables' 
              but provides no specific foods, portions, or calorie targets. 
              Doesn't address lactose intolerance."

Test Case 2: 182cm, 85kg, Lose fat, Vegetarian no soy
  Score: 3/10
  Reasoning: "Mentions general food categories but lacks structure. 
              No meal timing, no macros, no calorie target. Suggests 
              tofu despite 'no soy' restriction."

Test Case 3: 168cm, 62kg, Marathon training, Gluten-free
  Score: 2/10
  Reasoning: "Generic advice about 'eating healthy carbs.' No specific 
              meal plan, no consideration of endurance athlete needs, 
              suggests whole wheat bread despite gluten-free requirement."

Average Score: 2.3/10

Don't Be Discouraged by Low Scores

A score of 2.3 out of 10 is typical for a first attempt. The goal is to see consistent improvement as you apply engineering techniques.

The detailed evaluation report helps you understand exactly where your prompt is failing and what improvements are needed. Use this feedback to guide your next iteration.

Applying Prompt Engineering Techniques

Now that you have a baseline, you can systematically apply techniques to improve performance. Here are common techniques and their expected impact:

Technique 1: Add Explicit Instructions

Before:

prompt = f"What should this person eat?"

After:

prompt = f"""
Create a detailed 1-day meal plan for an athlete with the following profile:

- Height: {prompt_inputs["height"]}
- Weight: {prompt_inputs["weight"]}
- Goal: {prompt_inputs["goal"]}
- Dietary restrictions: {prompt_inputs["restrictions"]}

Your meal plan must include:
1. Total daily calories
2. Macronutrient breakdown (protein/carbs/fats in grams)
3. 4-5 meals with specific foods and portion sizes
4. Meal timing throughout the day
"""

Expected improvement: +2-3 points

Technique 2: Provide Examples (Few-Shot Prompting)

prompt = f"""
Create a detailed 1-day meal plan for an athlete.

Example format:
**Daily Totals:** 2,800 calories | 180g protein | 320g carbs | 90g fat

**Meal 1 (7:00 AM):**
- 3 whole eggs, scrambled
- 2 slices whole grain toast
- 1 medium banana
- 1 cup black coffee

[... more meals ...]

Now create a plan for:
- Height: {prompt_inputs["height"]}
- Weight: {prompt_inputs["weight"]}
- Goal: {prompt_inputs["goal"]}
- Dietary restrictions: {prompt_inputs["restrictions"]}
"""

Expected improvement: +1-2 points

Technique 3: Add Role Context

prompt = f"""
You are a certified sports nutritionist with 10 years of experience 
creating meal plans for competitive athletes.

Create a detailed 1-day meal plan for an athlete with this profile:
[... rest of prompt ...]
"""

Expected improvement: +0.5-1 point

Technique 4: Use Chain-of-Thought Reasoning

prompt = f"""
Create a 1-day meal plan for this athlete:
[... athlete details ...]

Before creating the meal plan, think through:
1. Calculate their estimated daily caloric needs based on height, weight, and goal
2. Determine optimal macro ratios for their specific goal
3. Identify foods that meet their restrictions
4. Plan meal timing to support their training

Then provide the complete meal plan.
"""

Expected improvement: +1-2 points

Iteration Example: Version 2

Let's apply multiple techniques and re-evaluate:

def run_prompt_v2(prompt_inputs):
    """
    Version 2: Added explicit instructions + structure + role context
    """
    prompt = f"""
You are a certified sports nutritionist. Create a detailed 1-day meal plan 
for an athlete with the following profile:

**Athlete Profile:**
- Height: {prompt_inputs["height"]}
- Weight: {prompt_inputs["weight"]}
- Goal: {prompt_inputs["goal"]}
- Dietary restrictions: {prompt_inputs["restrictions"]}

**Required Output Format:**

**Daily Totals:** [calories] | [protein]g protein | [carbs]g carbs | [fat]g fat

**Meal 1 (Time):**
- [Food item with portion]
- [Food item with portion]
[Calories: X | Protein: Xg | Carbs: Xg | Fat: Xg]

[... repeat for 4-5 meals ...]

**Important:**
- Calculate calorie needs based on athlete's stats and goal
- Ensure all foods comply with dietary restrictions
- Provide specific portion sizes (grams, cups, etc.)
- Include meal timing optimized for training
"""
    
    messages = []
    add_user_message(messages, prompt)
    return chat(messages)

Version 2 Results

📊 Evaluation Results - Version 2

Test Case 1: 175cm, 70kg, Build muscle, Lactose intolerant
  Score: 7/10
  Reasoning: "Provides complete meal plan with specific foods and portions. 
              Correctly avoids dairy. Includes calorie and macro totals. 
              Minor issue: could be more specific about training timing."

Test Case 2: 182cm, 85kg, Lose fat, Vegetarian no soy
  Score: 8/10
  Reasoning: "Excellent structure and detail. Respects both restrictions. 
              Appropriate calorie deficit for fat loss. Good protein sources 
              without soy. Well-timed meals."

Test Case 3: 168cm, 62kg, Marathon training, Gluten-free
  Score: 7/10
  Reasoning: "Good carb focus for endurance. All foods are gluten-free. 
              Includes timing around training. Could improve by adding 
              more electrolyte-rich foods for marathon prep."

Average Score: 7.3/10 (+5.0 improvement from baseline)

Tracking Progress Over Time

Keep a log of your iterations:

| Version | Techniques Applied | Avg Score | Change | |---------|-------------------|-----------|--------| | v1 (Baseline) | None | 2.3/10 | - | | v2 | Explicit instructions + role + structure | 7.3/10 | +5.0 | | v3 | Added few-shot examples | 8.1/10 | +0.8 | | v4 | Added chain-of-thought | 8.7/10 | +0.6 | | v5 | Refined output format | 9.2/10 | +0.5 |

This data-driven approach shows you exactly which techniques provide the most value.

When to Stop Iterating

You can stop when:

You hit your target score (e.g., 9/10 or higher)
Improvements plateau (changes yield <0.2 point gains)
The prompt meets production requirements (tested on larger dataset)
Cost/benefit analysis (further improvements aren't worth the effort)

Best Practices

1. Change One Thing at a Time

Don't apply multiple techniques simultaneously—you won't know which one helped.

2. Use Version Control

Save each prompt version with clear labels (v1, v2, v3) so you can roll back if needed.

3. Test on Fresh Data

Periodically generate new test cases to ensure you're not overfitting to your initial dataset.

4. Document Your Reasoning

Keep notes on why you made each change and what you expected to improve.

5. Validate with Humans

Spot-check high-scoring outputs manually to ensure the grader isn't being too generous.

Common Pitfalls

❌ Making too many changes at once → Can't identify what worked
❌ Ignoring the evaluation feedback → Guessing instead of using data
❌ Overfitting to test cases → Prompt works on eval set but fails in production
❌ Stopping too early → Settling for "good enough" when excellence is achievable
❌ Not testing edge cases → Prompt fails on unusual inputs

Next Steps

With your baseline established and your first iteration complete, you're ready to:

Learn specific prompt engineering techniques in depth
Expand your test dataset to cover more edge cases
Optimize for cost and latency while maintaining quality
Deploy with confidence knowing your prompt has been rigorously tested

Remember that prompt engineering is an iterative process. The key is to make one change at a time, evaluate the impact, and build on what works. This systematic approach ensures you understand which techniques provide the most value for your specific use case.

Full Code Example

from prompt_evaluator import PromptEvaluator

# Initialize evaluator
evaluator = PromptEvaluator(max_concurrent_tasks=3)

# Generate test dataset
dataset = evaluator.generate_dataset(
    task_description="Write a compact, concise 1 day meal plan for a single athlete",
    prompt_inputs_spec={
        "height": "Athlete's height in cm",
        "weight": "Athlete's weight in kg", 
        "goal": "Goal of the athlete",
        "restrictions": "Dietary restrictions of the athlete"
    },
    output_file="dataset.json",
    num_cases=3
)

# Version 1: Baseline
def run_prompt_v1(prompt_inputs):
    prompt = f"""
What should this person eat?

- Height: {prompt_inputs["height"]}
- Weight: {prompt_inputs["weight"]}
- Goal: {prompt_inputs["goal"]}
- Dietary restrictions: {prompt_inputs["restrictions"]}
"""
    messages = []
    add_user_message(messages, prompt)
    return chat(messages)

# Run evaluation
results_v1 = evaluator.run_evaluation(
    run_prompt_function=run_prompt_v1,
    dataset_file="dataset.json",
    extra_criteria="""
The output should include:
- Daily caloric total
- Macronutrient breakdown
- Meals with exact foods, portions, and timing
"""
)

print(f"Version 1 Average Score: {results_v1['average_score']}/10")

# Version 2: Improved
def run_prompt_v2(prompt_inputs):
    prompt = f"""
You are a certified sports nutritionist. Create a detailed 1-day meal plan.

**Athlete Profile:**
- Height: {prompt_inputs["height"]}
- Weight: {prompt_inputs["weight"]}
- Goal: {prompt_inputs["goal"]}
- Dietary restrictions: {prompt_inputs["restrictions"]}

Include daily totals, 4-5 meals with specific foods and portions, and meal timing.
"""
    messages = []
    add_user_message(messages, prompt)
    return chat(messages)

results_v2 = evaluator.run_evaluation(
    run_prompt_function=run_prompt_v2,
    dataset_file="dataset.json",
    extra_criteria="""
The output should include:
- Daily caloric total
- Macronutrient breakdown
- Meals with exact foods, portions, and timing
"""
)

print(f"Version 2 Average Score: {results_v2['average_score']}/10")
print(f"Improvement: +{results_v2['average_score'] - results_v1['average_score']}")

Key Takeaways

Start simple: Establish a baseline before optimizing
Measure everything: Use objective scores to track progress
Iterate systematically: One change at a time, evaluate, repeat
Use the feedback: Let evaluation results guide your improvements
Document your journey: Track what works and what doesn't
Know when to stop: Hit your target score or reach diminishing returns

Systematic prompt engineering transforms guesswork into science, ensuring every change you make is backed by data and moves you closer to production-ready prompts.