Prompt Testing Best Practices
7 min read
Most prompts are tested once with a clean input and shipped. Here is how to test prompts systematically so they do not break in production.
1. Test with representative inputs, not just ideal ones
Most people test prompts with clean, well-formatted input. Production input is messy, incomplete, and sometimes hostile. If your prompt only works on ideal input, it is not production-ready.
Steps
- 1. Collect 10-20 real inputs from the environment where the prompt will run.
- 2. Classify them: typical (60%), edge-case (20%), hostile (20%).
- 3. Run the prompt against all three categories and record failures separately.
- 4. Fix failures in the edge-case and hostile categories first, because those are the ones that break automation.
Anti-pattern
Testing with a single clean example and assuming the prompt works for all inputs.
2. Regression-test after every change
A prompt change that fixes one failure can break three others. Regression testing catches this before deployment.
Steps
- 1. Maintain a test suite of 5-10 inputs with expected output structures (not exact text, which varies).
- 2. After any prompt edit, run the full suite and compare output structure against the expected structure.
- 3. Flag any output that changes structure (missing sections, new sections, wrong format).
- 4. If a structure change is intentional, update the test suite. If not, revert the prompt change.
Anti-pattern
Changing a prompt and only testing the one case you were trying to fix.
3. Test boundary conditions explicitly
Boundary conditions are where prompts fail most often: empty input, very short input, very long input, input in the wrong language, input with special characters.
Steps
- 1. Define the minimum and maximum input length your prompt should handle.
- 2. Test with exactly the minimum, exactly the maximum, one below minimum, and one above maximum.
- 3. Test with empty input, non-text input (if applicable), and mixed-language input.
- 4. For each boundary case, verify that the prompt degrades gracefully rather than crashing or fabricating.
Anti-pattern
Assuming the prompt will handle edge cases because it handles normal cases well.
4. Version-control your prompts
Without version control, you cannot tell which change caused a regression, and you cannot roll back to a known-good state.
Steps
- 1. Store prompts in version-controlled files (not just in code comments or prompt playgrounds).
- 2. Tag each version with the test results (pass/fail on each test case).
- 3. When deploying a new version, tag the old version as a rollback target.
- 4. Include the prompt version in output metadata so you can trace issues to specific versions.
Anti-pattern
Editing prompts in a playground and copy-pasting into production without tracking changes.
5. Measure cost per run, not just quality
A prompt that produces perfect output but costs $0.50 per run is not viable if you are running it 1,000 times a day. Cost is a first-class quality metric.
Steps
- 1. Record input tokens, output tokens, and latency for each test run.
- 2. Calculate cost per run using your provider's pricing.
- 3. Set a cost ceiling per run and flag any run that exceeds it.
- 4. If cost is too high, look for opportunities to narrow the output scope, reduce output length, or split the task into smaller prompts.
Anti-pattern
Optimizing for output quality without tracking token cost, then discovering the prompt is too expensive to run at scale.
6. Use an evaluation loop, not a one-shot check
Prompt quality is not binary. An evaluation loop scores the prompt, identifies issues, applies fixes, and re-scores until the score meets a threshold.
Steps
- 1. Run the prompt through an evaluator (like the Prompt Evaluator) to get a baseline score.
- 2. Review the critical issues and fix the highest-severity one first.
- 3. Re-run the evaluator. If the score improved by more than 5 points, commit the change.
- 4. Repeat until the score exceeds your threshold (70+ for automation, 80+ for critical workflows).
Anti-pattern
Writing a prompt, eyeballing the output once, and shipping it without a systematic quality check.
Automate your evaluation loop
The Prompt Evaluator scores prompts across five dimensions, flags critical issues, and recommends fixes. Use it as the evaluation step in your iteration loop instead of eyeballing output.
The free evaluator is enough for one-off tests. The paid path makes more sense when prompt regression checks become part of normal operations.