- Published on
Prompt Engineering: Iterative Improvement Through Evaluation
- Authors

- Name
- Vuk Dukic
Founder, AI/ML Engineer

Prompt Engineering: Iterative Improvement Through Evaluation
Prompt engineering is about taking a prompt you've written and improving it to get more reliable, higher-quality outputs. This process involves iterative refinement—starting with a basic prompt, evaluating its performance, then systematically applying engineering techniques to improve it.
Unlike trial-and-error guessing, systematic prompt engineering uses objective measurement to guide improvements. Each change you make is validated through evaluation, ensuring you're actually making progress rather than just making changes.
The Iterative Improvement Process
The approach follows a clear cycle that you can repeat until you achieve your desired results:
1. Set a Goal
Define what you want your prompt to accomplish
2. Write an Initial Prompt
Create a basic first attempt
3. Evaluate the Prompt
Test it against your criteria
4. Apply Prompt Engineering Techniques
Use specific methods to improve performance
5. Re-evaluate
Verify that your changes actually improved the results
You repeat the last two steps until you're satisfied with the performance. Each iteration should show measurable improvement in your evaluation scores.
Why This Approach Works
Traditional Approach (Guesswork)
❌ Make random changes
❌ Subjectively judge if it "feels better"
❌ No way to know if you're improving
❌ Can't compare different versions objectively
Systematic Approach (Evaluation-Driven)
✅ Make targeted changes based on feedback
✅ Measure improvement with objective scores
✅ Track progress over time
✅ Confidently deploy the best-performing version
Setting Up Your Evaluation Pipeline
To demonstrate this process, we'll work with a practical example: creating a prompt that generates one-day meal plans for athletes. The prompt needs to take into account an athlete's height, weight, goals, and dietary restrictions, then produce a comprehensive meal plan.
The PromptEvaluator Class
The evaluation setup uses a PromptEvaluator class that handles dataset generation and model grading. When creating your evaluator instance, you can control concurrency with the max_concurrent_tasks parameter:
from prompt_evaluator import PromptEvaluator
evaluator = PromptEvaluator(max_concurrent_tasks=5)
Performance tip: Start with a low concurrency value (like 3) to avoid rate limit errors. You can increase it if your API quota allows for faster processing.
Generating Test Data
The evaluation system can automatically generate test cases based on your prompt requirements. You define what inputs your prompt needs:
dataset = evaluator.generate_dataset(
task_description="Write a compact, concise 1 day meal plan for a single athlete",
prompt_inputs_spec={
"height": "Athlete's height in cm",
"weight": "Athlete's weight in kg",
"goal": "Goal of the athlete",
"restrictions": "Dietary restrictions of the athlete"
},
output_file="dataset.json",
num_cases=3
)
What This Generates
The evaluator will create test cases like:
[
{
"height": "175 cm",
"weight": "70 kg",
"goal": "Build muscle mass",
"restrictions": "Lactose intolerant"
},
{
"height": "182 cm",
"weight": "85 kg",
"goal": "Lose body fat while maintaining strength",
"restrictions": "Vegetarian, no soy"
},
{
"height": "168 cm",
"weight": "62 kg",
"goal": "Improve endurance for marathon training",
"restrictions": "Gluten-free"
}
]
Development tip: Keep the number of test cases low (2-3) during development to speed up your iteration cycle. You can increase this to 10-20 for final validation.
Writing Your Initial Prompt
Start with a simple, naive prompt to establish a baseline. Here's an example of a deliberately basic first attempt:
def run_prompt(prompt_inputs):
"""
Version 1: Baseline prompt (intentionally simple)
"""
prompt = f"""
What should this person eat?
- Height: {prompt_inputs["height"]}
- Weight: {prompt_inputs["weight"]}
- Goal: {prompt_inputs["goal"]}
- Dietary restrictions: {prompt_inputs["restrictions"]}
"""
messages = []
add_user_message(messages, prompt)
return chat(messages)
Why Start Simple?
This basic prompt will likely produce poor results, but it gives you:
- A baseline score to measure improvement against
- Clear evidence of what's missing
- A foundation to build upon systematically
Adding Evaluation Criteria
When running your evaluation, you can specify additional criteria that the grading model should consider:
results = evaluator.run_evaluation(
run_prompt_function=run_prompt,
dataset_file="dataset.json",
extra_criteria="""
The output should include:
- Daily caloric total
- Macronutrient breakdown (protein, carbs, fats in grams)
- Meals with exact foods, portions, and timing
- Alignment with the athlete's specific goal
- Respect for all dietary restrictions
"""
)
This helps ensure your prompt is evaluated against the specific requirements that matter for your use case.
Example Grading Criteria
The evaluator will assess each output on:
- Completeness: Does it include all required elements?
- Accuracy: Are calorie and macro calculations correct?
- Specificity: Are portions and timing clearly defined?
- Personalization: Does it address the athlete's goal and restrictions?
- Usability: Is it easy to follow and implement?
Analyzing Results
After running an evaluation, you'll get both a numerical score and a detailed HTML report. The report shows you exactly how each test case performed, including the model's reasoning for each score.
Example Baseline Results
📊 Evaluation Results - Version 1 (Baseline)
Test Case 1: 175cm, 70kg, Build muscle, Lactose intolerant
Score: 2/10
Reasoning: "Response is too vague. Says 'eat protein and vegetables'
but provides no specific foods, portions, or calorie targets.
Doesn't address lactose intolerance."
Test Case 2: 182cm, 85kg, Lose fat, Vegetarian no soy
Score: 3/10
Reasoning: "Mentions general food categories but lacks structure.
No meal timing, no macros, no calorie target. Suggests
tofu despite 'no soy' restriction."
Test Case 3: 168cm, 62kg, Marathon training, Gluten-free
Score: 2/10
Reasoning: "Generic advice about 'eating healthy carbs.' No specific
meal plan, no consideration of endurance athlete needs,
suggests whole wheat bread despite gluten-free requirement."
Average Score: 2.3/10
Don't Be Discouraged by Low Scores
A score of 2.3 out of 10 is typical for a first attempt. The goal is to see consistent improvement as you apply engineering techniques.
The detailed evaluation report helps you understand exactly where your prompt is failing and what improvements are needed. Use this feedback to guide your next iteration.
Applying Prompt Engineering Techniques
Now that you have a baseline, you can systematically apply techniques to improve performance. Here are common techniques and their expected impact:
Technique 1: Add Explicit Instructions
Before:
prompt = f"What should this person eat?"
After:
prompt = f"""
Create a detailed 1-day meal plan for an athlete with the following profile:
- Height: {prompt_inputs["height"]}
- Weight: {prompt_inputs["weight"]}
- Goal: {prompt_inputs["goal"]}
- Dietary restrictions: {prompt_inputs["restrictions"]}
Your meal plan must include:
1. Total daily calories
2. Macronutrient breakdown (protein/carbs/fats in grams)
3. 4-5 meals with specific foods and portion sizes
4. Meal timing throughout the day
"""
Expected improvement: +2-3 points
Technique 2: Provide Examples (Few-Shot Prompting)
prompt = f"""
Create a detailed 1-day meal plan for an athlete.
Example format:
**Daily Totals:** 2,800 calories | 180g protein | 320g carbs | 90g fat
**Meal 1 (7:00 AM):**
- 3 whole eggs, scrambled
- 2 slices whole grain toast
- 1 medium banana
- 1 cup black coffee
[... more meals ...]
Now create a plan for:
- Height: {prompt_inputs["height"]}
- Weight: {prompt_inputs["weight"]}
- Goal: {prompt_inputs["goal"]}
- Dietary restrictions: {prompt_inputs["restrictions"]}
"""
Expected improvement: +1-2 points
Technique 3: Add Role Context
prompt = f"""
You are a certified sports nutritionist with 10 years of experience
creating meal plans for competitive athletes.
Create a detailed 1-day meal plan for an athlete with this profile:
[... rest of prompt ...]
"""
Expected improvement: +0.5-1 point
Technique 4: Use Chain-of-Thought Reasoning
prompt = f"""
Create a 1-day meal plan for this athlete:
[... athlete details ...]
Before creating the meal plan, think through:
1. Calculate their estimated daily caloric needs based on height, weight, and goal
2. Determine optimal macro ratios for their specific goal
3. Identify foods that meet their restrictions
4. Plan meal timing to support their training
Then provide the complete meal plan.
"""
Expected improvement: +1-2 points
Iteration Example: Version 2
Let's apply multiple techniques and re-evaluate:
def run_prompt_v2(prompt_inputs):
"""
Version 2: Added explicit instructions + structure + role context
"""
prompt = f"""
You are a certified sports nutritionist. Create a detailed 1-day meal plan
for an athlete with the following profile:
**Athlete Profile:**
- Height: {prompt_inputs["height"]}
- Weight: {prompt_inputs["weight"]}
- Goal: {prompt_inputs["goal"]}
- Dietary restrictions: {prompt_inputs["restrictions"]}
**Required Output Format:**
**Daily Totals:** [calories] | [protein]g protein | [carbs]g carbs | [fat]g fat
**Meal 1 (Time):**
- [Food item with portion]
- [Food item with portion]
[Calories: X | Protein: Xg | Carbs: Xg | Fat: Xg]
[... repeat for 4-5 meals ...]
**Important:**
- Calculate calorie needs based on athlete's stats and goal
- Ensure all foods comply with dietary restrictions
- Provide specific portion sizes (grams, cups, etc.)
- Include meal timing optimized for training
"""
messages = []
add_user_message(messages, prompt)
return chat(messages)
Version 2 Results
📊 Evaluation Results - Version 2
Test Case 1: 175cm, 70kg, Build muscle, Lactose intolerant
Score: 7/10
Reasoning: "Provides complete meal plan with specific foods and portions.
Correctly avoids dairy. Includes calorie and macro totals.
Minor issue: could be more specific about training timing."
Test Case 2: 182cm, 85kg, Lose fat, Vegetarian no soy
Score: 8/10
Reasoning: "Excellent structure and detail. Respects both restrictions.
Appropriate calorie deficit for fat loss. Good protein sources
without soy. Well-timed meals."
Test Case 3: 168cm, 62kg, Marathon training, Gluten-free
Score: 7/10
Reasoning: "Good carb focus for endurance. All foods are gluten-free.
Includes timing around training. Could improve by adding
more electrolyte-rich foods for marathon prep."
Average Score: 7.3/10 (+5.0 improvement from baseline)
Tracking Progress Over Time
Keep a log of your iterations:
| Version | Techniques Applied | Avg Score | Change | |---------|-------------------|-----------|--------| | v1 (Baseline) | None | 2.3/10 | - | | v2 | Explicit instructions + role + structure | 7.3/10 | +5.0 | | v3 | Added few-shot examples | 8.1/10 | +0.8 | | v4 | Added chain-of-thought | 8.7/10 | +0.6 | | v5 | Refined output format | 9.2/10 | +0.5 |
This data-driven approach shows you exactly which techniques provide the most value.
When to Stop Iterating
You can stop when:
- You hit your target score (e.g., 9/10 or higher)
- Improvements plateau (changes yield <0.2 point gains)
- The prompt meets production requirements (tested on larger dataset)
- Cost/benefit analysis (further improvements aren't worth the effort)
Best Practices
1. Change One Thing at a Time
Don't apply multiple techniques simultaneously—you won't know which one helped.
2. Use Version Control
Save each prompt version with clear labels (v1, v2, v3) so you can roll back if needed.
3. Test on Fresh Data
Periodically generate new test cases to ensure you're not overfitting to your initial dataset.
4. Document Your Reasoning
Keep notes on why you made each change and what you expected to improve.
5. Validate with Humans
Spot-check high-scoring outputs manually to ensure the grader isn't being too generous.
Common Pitfalls
❌ Making too many changes at once → Can't identify what worked
❌ Ignoring the evaluation feedback → Guessing instead of using data
❌ Overfitting to test cases → Prompt works on eval set but fails in production
❌ Stopping too early → Settling for "good enough" when excellence is achievable
❌ Not testing edge cases → Prompt fails on unusual inputs
Next Steps
With your baseline established and your first iteration complete, you're ready to:
- Learn specific prompt engineering techniques in depth
- Expand your test dataset to cover more edge cases
- Optimize for cost and latency while maintaining quality
- Deploy with confidence knowing your prompt has been rigorously tested
Remember that prompt engineering is an iterative process. The key is to make one change at a time, evaluate the impact, and build on what works. This systematic approach ensures you understand which techniques provide the most value for your specific use case.
Full Code Example
from prompt_evaluator import PromptEvaluator
# Initialize evaluator
evaluator = PromptEvaluator(max_concurrent_tasks=3)
# Generate test dataset
dataset = evaluator.generate_dataset(
task_description="Write a compact, concise 1 day meal plan for a single athlete",
prompt_inputs_spec={
"height": "Athlete's height in cm",
"weight": "Athlete's weight in kg",
"goal": "Goal of the athlete",
"restrictions": "Dietary restrictions of the athlete"
},
output_file="dataset.json",
num_cases=3
)
# Version 1: Baseline
def run_prompt_v1(prompt_inputs):
prompt = f"""
What should this person eat?
- Height: {prompt_inputs["height"]}
- Weight: {prompt_inputs["weight"]}
- Goal: {prompt_inputs["goal"]}
- Dietary restrictions: {prompt_inputs["restrictions"]}
"""
messages = []
add_user_message(messages, prompt)
return chat(messages)
# Run evaluation
results_v1 = evaluator.run_evaluation(
run_prompt_function=run_prompt_v1,
dataset_file="dataset.json",
extra_criteria="""
The output should include:
- Daily caloric total
- Macronutrient breakdown
- Meals with exact foods, portions, and timing
"""
)
print(f"Version 1 Average Score: {results_v1['average_score']}/10")
# Version 2: Improved
def run_prompt_v2(prompt_inputs):
prompt = f"""
You are a certified sports nutritionist. Create a detailed 1-day meal plan.
**Athlete Profile:**
- Height: {prompt_inputs["height"]}
- Weight: {prompt_inputs["weight"]}
- Goal: {prompt_inputs["goal"]}
- Dietary restrictions: {prompt_inputs["restrictions"]}
Include daily totals, 4-5 meals with specific foods and portions, and meal timing.
"""
messages = []
add_user_message(messages, prompt)
return chat(messages)
results_v2 = evaluator.run_evaluation(
run_prompt_function=run_prompt_v2,
dataset_file="dataset.json",
extra_criteria="""
The output should include:
- Daily caloric total
- Macronutrient breakdown
- Meals with exact foods, portions, and timing
"""
)
print(f"Version 2 Average Score: {results_v2['average_score']}/10")
print(f"Improvement: +{results_v2['average_score'] - results_v1['average_score']}")
Key Takeaways
- Start simple: Establish a baseline before optimizing
- Measure everything: Use objective scores to track progress
- Iterate systematically: One change at a time, evaluate, repeat
- Use the feedback: Let evaluation results guide your improvements
- Document your journey: Track what works and what doesn't
- Know when to stop: Hit your target score or reach diminishing returns
Systematic prompt engineering transforms guesswork into science, ensuring every change you make is backed by data and moves you closer to production-ready prompts.