Creating diverse, targeted test data for evaluation.
Define axes of variation, then generate combinations:
dimensions = {
"issue_type": ["billing", "technical", "shipping"],
"customer_mood": ["frustrated", "neutral", "happy"],
"complexity": ["simple", "moderate", "complex"],
}- Generate tuples (combinations of dimension values)
- Convert to natural queries (separate LLM call per tuple)
# Step 1: Create tuples
tuples = [
("billing", "frustrated", "complex"),
("shipping", "neutral", "simple"),
]
# Step 2: Convert to natural query
def tuple_to_query(t):
prompt = f"""Generate a realistic customer message:
Issue: {t[0]}, Mood: {t[1]}, Complexity: {t[2]}
Write naturally, include typos if appropriate. Don't be formulaic."""
return llm(prompt)Dimensions should target known failures from error analysis:
# From error analysis findings
dimensions = {
"timezone": ["EST", "PST", "UTC", "ambiguous"], # Known failure
"date_format": ["ISO", "US", "EU", "relative"], # Known failure
}- Validate: Check for placeholder text, minimum length
- Deduplicate: Remove near-duplicate queries using embeddings
- Balance: Ensure coverage across dimension values
| Use Synthetic | Use Real Data |
|---|---|
| Limited production data | Sufficient traces |
| Testing edge cases | Validating actual behavior |
| Pre-launch evals | Post-launch monitoring |
| Purpose | Size |
|---|---|
| Initial exploration | 50-100 |
| Comprehensive eval | 100-500 |
| Per-dimension | 10-20 per combination |