← Guides

Prompt Evaluation Checklist

5 min read

Run these 25 checks against any prompt before shipping it to production. If it fails more than three, it is not ready for automation.

Clarity & Ambiguity

Does the prompt specify the output format (JSON, markdown, table, bullets)?

Pass

Output shape is explicit.

Fail

Model will guess the format, producing inconsistent results.

Does the prompt define the output length or section count?

Pass

Length constraints prevent runaway output.

Fail

No length ceiling means variable token cost and verbosity.

Are instructions specific enough that two people would produce the same output structure?

Pass

Low interpretation variance.

Fail

Ambiguous instructions mean every run is a surprise.

Does the prompt avoid filler words like 'something', 'maybe', 'kind of'?

Pass

Instruction language is precise.

Fail

Vague language invites inconsistent interpretation.

Is there a step-by-step instruction sequence rather than a single vague request?

Pass

Decomposed task is easier to execute reliably.

Fail

Monolithic instruction leaves execution order to the model.

Does the prompt include an example of the expected output?

Pass

Example anchors the model's output shape.

Fail

No example means the model infers structure from instructions alone.

Are there explicit ordering instructions (e.g., 'in this order')?

Pass

Output sections appear in a predictable sequence.

Fail

Section order may vary between runs.

Context Completeness

Does the prompt set a role or persona (e.g., 'You are a...')?

Pass

Role frames the model's perspective and vocabulary.

Fail

Without a role, the model defaults to generic assistant behavior.

Does the prompt name the intended audience for the output?

Pass

Audience calibrates complexity and jargon level.

Fail

Unknown audience means unknown calibration.

Does the prompt state the goal or success criteria?

Pass

Model knows what 'done right' looks like.

Fail

Model optimizes for plausible output, not correct output.

Does the prompt include constraints (what to avoid, what not to do)?

Pass

Negative constraints reduce failure modes.

Fail

No constraints means the model has no boundaries.

Is the business context or use case stated?

Pass

Model can prioritize relevance over completeness.

Fail

Output may be comprehensive but not useful.

Automation Robustness

Does the prompt handle missing or ambiguous input (e.g., 'return null if unknown')?

Pass

Graceful degradation on edge cases.

Fail

Model will fabricate data to fill gaps.

Are there conditional instructions ('if X, do Y; otherwise, do Z')?

Pass

Branching logic handles variation.

Fail

Single-path instruction breaks on unexpected input.

Does the prompt define what to do when confidence is low?

Pass

Uncertainty is surfaced, not hidden.

Fail

Low-confidence output looks the same as high-confidence output.

Does the prompt include validation or verification instructions?

Pass

Model self-checks before outputting.

Fail

No self-check means errors propagate silently.

Can the prompt handle empty, very short, or very long input?

Pass

Boundary conditions are addressed.

Fail

Extreme inputs will produce unpredictable results.

Does the prompt specify a fallback action for unclassifiable input?

Pass

Safety valve catches edge cases.

Fail

Model is forced to classify even when it should not.

Cost Efficiency

Is the prompt asking for the minimum necessary output (not 'give me everything')?

Pass

Targeted extraction minimizes token cost.

Fail

Open-ended request burns tokens on irrelevant output.

Does the prompt constrain output length per section?

Pass

Length ceilings prevent cost drift.

Fail

No length constraint means variable cost per run.

Does the prompt avoid redundant or overlapping instructions?

Pass

Each instruction adds unique value.

Fail

Redundant instructions waste tokens and confuse execution.

Failure Mode Safety

Does the prompt include an anti-hallucination instruction (e.g., 'do not invent facts')?

Pass

Model is explicitly told not to fabricate.

Fail

Model will fill gaps with plausible but fabricated content.

Does the prompt instruct the model to flag uncertainty?

Pass

Uncertainty is marked, not hidden.

Fail

Uncertain output appears confident, misleading downstream systems.

Does the prompt guard against harmful, biased, or off-topic output?

Pass

Safety boundaries reduce risk.

Fail

No guardrails means the model can produce harmful content.

Is there a human-review trigger for low-confidence or edge-case outputs?

Pass

Human-in-the-loop catches what the prompt cannot.

Fail

All output is treated as equally reliable.

Automate the checklist

The Prompt Evaluator runs these checks (and more) automatically. Paste any prompt and get a weighted score, critical issue flags, and concrete fix recommendations in seconds.

If your team is running this checklist repeatedly before shipping prompts, that is the point where paid QA becomes easier to justify.