Prompt Evaluation Checklist
5 min read
Run these 25 checks against any prompt before shipping it to production. If it fails more than three, it is not ready for automation.
Clarity & Ambiguity
Does the prompt specify the output format (JSON, markdown, table, bullets)?
Pass
Output shape is explicit.
Fail
Model will guess the format, producing inconsistent results.
Does the prompt define the output length or section count?
Pass
Length constraints prevent runaway output.
Fail
No length ceiling means variable token cost and verbosity.
Are instructions specific enough that two people would produce the same output structure?
Pass
Low interpretation variance.
Fail
Ambiguous instructions mean every run is a surprise.
Does the prompt avoid filler words like 'something', 'maybe', 'kind of'?
Pass
Instruction language is precise.
Fail
Vague language invites inconsistent interpretation.
Is there a step-by-step instruction sequence rather than a single vague request?
Pass
Decomposed task is easier to execute reliably.
Fail
Monolithic instruction leaves execution order to the model.
Does the prompt include an example of the expected output?
Pass
Example anchors the model's output shape.
Fail
No example means the model infers structure from instructions alone.
Are there explicit ordering instructions (e.g., 'in this order')?
Pass
Output sections appear in a predictable sequence.
Fail
Section order may vary between runs.
Context Completeness
Does the prompt set a role or persona (e.g., 'You are a...')?
Pass
Role frames the model's perspective and vocabulary.
Fail
Without a role, the model defaults to generic assistant behavior.
Does the prompt name the intended audience for the output?
Pass
Audience calibrates complexity and jargon level.
Fail
Unknown audience means unknown calibration.
Does the prompt state the goal or success criteria?
Pass
Model knows what 'done right' looks like.
Fail
Model optimizes for plausible output, not correct output.
Does the prompt include constraints (what to avoid, what not to do)?
Pass
Negative constraints reduce failure modes.
Fail
No constraints means the model has no boundaries.
Is the business context or use case stated?
Pass
Model can prioritize relevance over completeness.
Fail
Output may be comprehensive but not useful.
Automation Robustness
Does the prompt handle missing or ambiguous input (e.g., 'return null if unknown')?
Pass
Graceful degradation on edge cases.
Fail
Model will fabricate data to fill gaps.
Are there conditional instructions ('if X, do Y; otherwise, do Z')?
Pass
Branching logic handles variation.
Fail
Single-path instruction breaks on unexpected input.
Does the prompt define what to do when confidence is low?
Pass
Uncertainty is surfaced, not hidden.
Fail
Low-confidence output looks the same as high-confidence output.
Does the prompt include validation or verification instructions?
Pass
Model self-checks before outputting.
Fail
No self-check means errors propagate silently.
Can the prompt handle empty, very short, or very long input?
Pass
Boundary conditions are addressed.
Fail
Extreme inputs will produce unpredictable results.
Does the prompt specify a fallback action for unclassifiable input?
Pass
Safety valve catches edge cases.
Fail
Model is forced to classify even when it should not.
Cost Efficiency
Is the prompt asking for the minimum necessary output (not 'give me everything')?
Pass
Targeted extraction minimizes token cost.
Fail
Open-ended request burns tokens on irrelevant output.
Does the prompt constrain output length per section?
Pass
Length ceilings prevent cost drift.
Fail
No length constraint means variable cost per run.
Does the prompt avoid redundant or overlapping instructions?
Pass
Each instruction adds unique value.
Fail
Redundant instructions waste tokens and confuse execution.
Failure Mode Safety
Does the prompt include an anti-hallucination instruction (e.g., 'do not invent facts')?
Pass
Model is explicitly told not to fabricate.
Fail
Model will fill gaps with plausible but fabricated content.
Does the prompt instruct the model to flag uncertainty?
Pass
Uncertainty is marked, not hidden.
Fail
Uncertain output appears confident, misleading downstream systems.
Does the prompt guard against harmful, biased, or off-topic output?
Pass
Safety boundaries reduce risk.
Fail
No guardrails means the model can produce harmful content.
Is there a human-review trigger for low-confidence or edge-case outputs?
Pass
Human-in-the-loop catches what the prompt cannot.
Fail
All output is treated as equally reliable.
Automate the checklist
The Prompt Evaluator runs these checks (and more) automatically. Paste any prompt and get a weighted score, critical issue flags, and concrete fix recommendations in seconds.
If your team is running this checklist repeatedly before shipping prompts, that is the point where paid QA becomes easier to justify.