← Back to evaluator

Prompt Evaluation Benchmarks

Real teardowns showing how prompts and workflows score across five dimensions. Each benchmark includes the original input, dimension scores, critical issues, and concrete fixes.

3 weak4 usable1 strong4 workflow benchmarks

Start with the live comparison

The fastest proof on this page is now visible before the teardown archive: run the weak research prompt here, compare it against the strongest workflow demo on the same page, then choose the paid path only after you see the score gap.

Loading evaluator...

Choose the next step after the score

Keep the proof visible while the visitor decides what to do next

Once the weak-vs-strong difference is clear above, use the same-page chooser below to rerun the strongest demo, replace the canned text with your own prompt, take the lower-friction paid step, or jump straight to recurring QA.

Vague research summarizer

Verdict: weak

weak · 28/100

Original input

Summarize this article in 3 bullet points.

Dimension breakdown

Clarity & Ambiguity

15/100 · 25%

Single vague instruction with no output shape or role definition.

Context Completeness

5/100 · 20%

No role, no audience, no examples, no success criteria.

Automation Robustness

10/100 · 25%

No conditionals, no validation, no boundaries for edge cases.

Cost Efficiency

70/100 · 15%

Short prompt and constrained output length.

Failure Mode Safety

35/100 · 15%

No anti-hallucination guidance or uncertainty handling.

Critical issues

No explicit output format specified

high

Define the bullet structure: each bullet should start with a category label, contain one key finding, and stay under 30 words.

No role or operating context

medium

Set the model's frame: 'You are a research analyst producing executive summaries for a non-technical audience.'

No safety or boundary instructions

medium

Add: 'Do not add information not present in the article. Flag if the article is too short to summarize meaningfully.'

Prompt is probably too short for dependable automation

high

Expand to include context, expected output shape, edge-case handling, and quality constraints.

Strengths

  • The prompt has enough material to analyze and iterate on.

Recommendations

  • Improve clarity & ambiguity by adding clearer instructions, constraints, and examples.
  • Improve context completeness by adding clearer instructions, constraints, and examples.
  • Improve automation robustness by adding clearer instructions, constraints, and examples.

Key takeaway

A 9-word prompt looks efficient but guarantees inconsistent output at scale. Adding role, format, and guardrails takes it from 'hope for the best' to a repeatable process.

Structured meeting notes extractor

Verdict: strong

strong · 82/100

Original input

You are an operations assistant for a small agency. Review the following client meeting notes and return: (1) a concise summary, (2) decisions made, (3) action items with owner, deadline, and priority, and (4) follow-up questions. Use markdown sections exactly in that order. If an owner or deadline is missing, return null instead of guessing. Do not invent facts that are not in the notes. If the notes are ambiguous, flag the uncertainty clearly before the relevant item.

Dimension breakdown

Clarity & Ambiguity

85/100 · 25%

Prompt includes explicit output structure and markdown section ordering.

Context Completeness

75/100 · 20%

Sets a role and operating frame, defines success criteria.

Automation Robustness

80/100 · 25%

Handles missing data (null returns) and ambiguous input (uncertainty flagging).

Cost Efficiency

65/100 · 15%

Reasonably compact for the task depth required.

Failure Mode Safety

90/100 · 15%

Explicit anti-hallucination instruction and uncertainty escalation.

Strengths

  • Automation Robustness is reasonably strong.
  • Failure Mode Safety is reasonably strong.
  • Clarity & Ambiguity is reasonably strong.

Recommendations

  • This prompt is in good shape. Next step: test it on 3-5 representative real inputs.

Key takeaway

This prompt scores 'strong' because it defines the output shape, handles missing data explicitly, and guards against hallucination. The one gap: no explicit length constraint on each section, which can cause cost drift on long meetings.

Unbound customer support triage

Verdict: usable

usable · 58/100

Original input

You are a customer support agent. Read the customer message and classify it as one of: billing_issue, technical_bug, feature_request, or account_question. Then write a reply that acknowledges the problem, gives the most likely fix, and tells the customer what happens next. Be friendly but concise.

Dimension breakdown

Clarity & Ambiguity

70/100 · 25%

Classification categories are explicit, but reply format is unspecified.

Context Completeness

55/100 · 20%

Role is set, but no examples, no audience context, and limited success criteria.

Automation Robustness

35/100 · 25%

No handling for messages that don't fit any category, or for messages with multiple issues.

Cost Efficiency

60/100 · 15%

'Concise' is subjective — no word limit or section structure for the reply.

Failure Mode Safety

50/100 · 15%

No instruction for when the model is unsure about classification, or when 'most likely fix' could be harmful.

Critical issues

No fallback for unclassifiable messages

high

Add a fifth category like 'other' or 'needs_human' and instruct the model to use it when confidence is low.

No safety or boundary instructions

medium

Add: 'Never suggest actions that could damage the customer's account. If unsure about a technical fix, recommend contacting support instead.'

No explicit output format specified

medium

Define the reply template: greeting → classification → fix → next steps, with max lengths per section.

Strengths

  • Clarity & Ambiguity is reasonably strong.

Recommendations

  • Improve automation robustness by adding conditionals for edge cases and multi-issue messages.
  • Improve failure mode safety by adding constraints on what the model should not do or say.

Key takeaway

Classification + freeform reply is a common pattern, but leaving the reply unbounded means every output will be shaped differently. Adding a reply template and an 'uncertain → escalate' rule takes this from 'usable' to 'strong'.

Multi-step agent workflow

Verdict: usable

usable · 55/100

Original input

You are a research agent. Find information about the company I specify. First, search the web for the company name and collect the top 5 results. Then, for each result, extract key facts about the company: founding year, industry, revenue if available, and number of employees. Finally, compile all the facts into a summary report.

Dimension breakdown

Clarity & Ambiguity

60/100 · 25%

Multi-step structure is defined, but 'key facts' and 'summary report' are vague — no output format, no section ordering, no length constraints.

Context Completeness

50/100 · 20%

Role is set, but no audience, no use case, no success criteria for what makes a 'good' summary.

Automation Robustness

30/100 · 25%

No handling for: company not found, conflicting facts across sources, missing data fields, or search returning no results. Each step can fail silently.

Cost Efficiency

40/100 · 15%

'Collect the top 5 results' and extract facts from each is open-ended — no token ceiling per step or per result.

Failure Mode Safety

45/100 · 15%

No instruction to flag uncertain data, verify facts across sources, or handle contradictory information. Agent could present unverified data as fact.

Critical issues

No error handling for failed search steps

high

Add: 'If the search returns no results, return { "status": "no_results", "company": "[name]" } and stop. Do not fabricate company information.'

No cross-source verification instruction

high

Add: 'If two sources contradict on a fact, flag the conflict and include both values with source attribution instead of picking one.'

No output format for the summary report

medium

Define the report template: Company Overview → Key Facts Table → Source Attribution → Confidence Notes. Use markdown with specific section headers.

'Key facts' is undefined — which facts matter?

medium

Replace 'key facts' with an explicit list: 'founding_year, industry, headquarters_city, latest_revenue, employee_count, ceo_name'. Mark any unavailable field as 'data_unavailable' instead of skipping it.

No token or length budget per step

low

Add per-step limits: 'Extract no more than 50 words per result. The final summary must not exceed 500 words.'

Strengths

  • The multi-step structure (search → extract → compile) is explicitly defined.
  • A role is set, which frames the agent's operating mode.
  • The prompt decomposes a complex task into sequential steps, which is better than a monolithic request.

Recommendations

  • Add error handling and stop conditions for each step (search fails, no results, conflicting data).
  • Replace vague terms ('key facts', 'summary report') with explicit field lists and output templates.
  • Add cross-source verification instructions to prevent presenting unverified data as fact.
  • Add per-step token budgets to control cost on large result sets.

Key takeaway

Multi-step agent prompts look powerful because they decompose a task. But without error handling at each step, cross-source verification, and defined output formats, they fail silently in production. The most dangerous prompt is one that looks like it should work but has hidden gaps that only appear under messy real-world conditions.

Automated invoice processing pipeline

Verdict: usable

usable · 61/100

Original input

INVOICE PROCESSING WORKFLOW

Step 1 — Extract: Read the uploaded invoice document and extract: vendor_name, invoice_number, invoice_date, line_items (array of {description, quantity, unit_price, total}), and total_amount.

Step 2 — Validate: Check that extracted totals match. If line_item totals don't sum to the stated total_amount, flag the discrepancy as {status: "mismatch", delta: NUMBER}. If any required field is missing, return {status: "missing_field", field: "NAME"} and stop.

Step 3 — Categorize: Based on the vendor name and line item descriptions, assign a spending category from: software_subscription, professional_services, infrastructure, marketing, office_supplies. Output as {category: "CATEGORY", confidence: 0.0-1.0}.

Step 4 — Route: If total_amount > $10,000, output {needs_approval: true, approver: "CFO"}. If total_amount < $500, output {needs_approval: false, auto_approve: true}. Otherwise output {needs_approval: true, approver: "Manager"}.

Step 5 — Output: Produce a structured JSON object with all extracted and computed fields combined. Use null for any field that could not be determined.

Dimension breakdown

Clarity & Ambiguity

75/100 · 25%

Five explicit steps with named outputs and clear transitions. Each step's expected output format is specified.

Context Completeness

55/100 · 20%

Step purposes are clear, but no mention of what to do with non-PDF formats, scanned invoices, or non-English documents.

Automation Robustness

55/100 · 25%

Some error handling exists (missing fields, totals mismatch), but no handling for corrupted files, password-protected PDFs, or duplicate invoices.

Cost Efficiency

60/100 · 15%

Pipeline is reasonably scoped per step. No token budgets defined, but task complexity keeps context usage bounded.

Failure Mode Safety

50/100 · 15%

No instruction for what happens if the model can't extract anything useful — it may still produce partial output with fabricated fields.

Critical issues

No handling for unreadable or non-PDF invoices

high

Add: 'If the document cannot be parsed (corrupted, scanned image with no text layer, or password-protected), return {status: "unreadable", reason: "TYPE"} and stop without fabricating data.'

No duplicate detection across invoice numbers

high

Add: 'Before processing, check if this invoice_number has been seen before. If yes, return {status: "duplicate", existing_record: {...}} and halt processing.'

No output format defined for the final Step 5

medium

Define the exact JSON schema for the final output, including all field names, types, and null handling for missing data.

Confidence scoring is subjective without calibration

medium

Define what confidence means: 'Use confidence only when at least two line items clearly match a known category pattern. Set confidence < 0.5 if any line item description is ambiguous.'

Strengths

  • The pipeline breaks a complex task into five discrete, sequential steps with explicit stop conditions.
  • Validation step catches a specific data integrity failure (totals mismatch) before it propagates.
  • Routing step adds a business-rule gate that maps dollar amounts to approval workflows.

Recommendations

  • Add format detection and rejection for non-processable document types.
  • Add duplicate detection by invoice_number before processing starts.
  • Define an explicit JSON schema for the Step 5 final output.
  • Add a confidence calibration guide so category confidence scores are consistent.

Key takeaway

This pipeline is well-structured for its happy path, but production invoice processing encounters scanned documents, duplicates, and OCR failures constantly. Adding a 'can I even read this?' gate before Step 1 is the single highest-ROI improvement.

Slack alert escalation agent

Verdict: weak

weak · 31/100

Original input

Monitor our monitoring system for alerts. When an alert fires, figure out how serious it is and notify the right person on Slack.

Dimension breakdown

Clarity & Ambiguity

15/100 · 25%

No definition of what 'alerts' look like, what format they arrive in, or what 'serious' means. No Slack message template specified.

Context Completeness

20/100 · 20%

No on-call schedule, no severity tiers, no escalation path, no SLA definitions, no list of stakeholders.

Automation Robustness

10/100 · 25%

No polling interval, no handling for Slack API failures, no handling for duplicate alerts, no handling for monitoring system downtime.

Cost Efficiency

30/100 · 15%

Open-ended monitoring loop with no token budget and no deduplication could rack up massive API costs.

Failure Mode Safety

20/100 · 15%

Automated Slack messages that 'notify the right person' without validation could spam the wrong team, miss a P0 incident, or expose sensitive alert data to the wrong channel.

Critical issues

No alert format or severity definition

high

Define: 'Alerts arrive as JSON with fields: service_name, severity (p0/p1/p2/p3), message, timestamp. Map severity to response SLA: P0 → 5 min, P1 → 30 min, P2 → 4 hours, P3 → next business day.'

No on-call or escalation path defined

high

Add: 'Use the on-call rotation at oncall.example.com. If primary is unreachable after one Slack DM and one Slack ping, escalate to secondary. After 2 failed escalations, page the engineering manager directly.'

No Slack message template or safety constraints

high

Define the Slack message format, maximum alert age before suppression, and rate-limit rules. Add: 'Never send more than 1 Slack message per alert per 5 minutes, even if the alert re-fires.'

No handling for monitoring system failures

medium

Add: 'If the monitoring system returns an error or times out after 10 seconds, send an alert to the on-call engineer with subject: MONITORING_SYSTEM_DOWN.'

Strengths

  • The intent is clear: something fires, something gets notified.

Recommendations

  • Define alert format, severity tiers, and SLA mapping before anything else.
  • Build the on-call and escalation matrix first — it's the load-bearing component.
  • Add a Slack message template with required fields, character limits, and retry/backoff logic.
  • Add monitoring-system health check as a self-healing mechanism.

Key takeaway

This is the most common automation failure mode: a vague intent wrapped in confident language. 'Figure out how serious it is' sounds like AI but provides zero actual logic. Every term needs a concrete definition before this can run unattended.

Email outreach personalization pipeline

Verdict: usable

usable · 52/100

Original input

You are a sales assistant. I will give you a LinkedIn profile URL and the person's name. Generate a personalized cold email that: (1) mentions something specific from their recent activity, (2) connects it to our product, and (3) ends with a soft ask. Keep it under 150 words.

Dimension breakdown

Clarity & Ambiguity

55/100 · 25%

Structure is defined (3 parts), but 'something specific from recent activity' and 'soft ask' are undefined. No email template or tone guidance.

Context Completeness

40/100 · 20%

Role is set, but no product description, no target audience definition, no example emails, and no definition of what makes a 'good' personalization.

Automation Robustness

25/100 · 25%

No handling for: LinkedIn profile not found, no recent activity, private profile, or when the person has no clear connection to the product. Pipeline will produce output regardless of input quality.

Cost Efficiency

70/100 · 15%

Word limit of 150 helps constrain output length. Pipeline is reasonably scoped.

Failure Mode Safety

45/100 · 15%

No instruction for when to NOT send an email. Model could generate creepy or irrelevant personalizations that damage brand reputation. No quality gate before the email is 'ready to send'.

Critical issues

No handling for missing or private LinkedIn profiles

high

Add: 'If the LinkedIn profile is private, not found, or has no public recent activity, return {status: "cannot_personalize", reason: "TYPE"} instead of fabricating personalization.

No product context provided to the model

high

Add product context block: 'Our product is [X] which helps [Y] achieve [Z]. Only connect their activity to relevant use cases.'

No definition of 'soft ask' or email structure template

medium

Define: 'A soft ask is a low-friction CTA like 'mind if I share a 2-min overview?' or 'worth a quick chat?'. Structure: hook (their activity) → bridge (to product) → soft ask → sign-off.

No quality gate before marking email 'ready'

medium

Add: 'Before finalizing, check: (1) Is the personalization factually grounded in their profile? (2) Is the connection to our product plausible? If either is NO, return {status: "low_confidence", suggestions: [...]}. Only return a complete email when both pass.

No tone or brand voice guidelines

low

Add: 'Write in a conversational, non-salesy tone. Avoid buzzwords, superlatives, or generic compliments like 'impressive background'.

Strengths

  • The three-part structure (mention → connect → ask) provides a clear email skeleton.
  • Word limit of 150 prevents overly long emails that reduce response rates.
  • Role definition frames the assistant's operating mode.

Recommendations

  • Add explicit handling for edge cases (private profiles, no activity, irrelevant targets).
  • Provide product context so the model can make legitimate connections.
  • Define 'soft ask' and add a template structure for the email body.
  • Add a quality gate that validates personalization before returning output.

Key takeaway

Cold email personalization at scale fails when the model fabricates connections or produces generic fluff. The highest-risk failure mode is an email that looks personalized but actually feels creepy or irrelevant. Adding a 'should I even send this?' gate and grounding checks would move this from 'usable' to 'strong'.

SEO content generation pipeline

Verdict: weak

weak · 24/100

Original input

Write a 1500-word blog post about [TOPIC]. Include an introduction, 3 main sections with subheadings, and a conclusion. Make it SEO-friendly and engaging for readers.

Dimension breakdown

Clarity & Ambiguity

20/100 · 25%

'SEO-friendly' and 'engaging' are undefined. No keyword targets, no audience definition, no tone guidance, no example structure.

Context Completeness

10/100 · 20%

No role, no target audience, no competitor context, no brand voice guidelines, no success criteria for what makes the post 'good'.

Automation Robustness

15/100 · 25%

No handling for: topics that are too narrow/broad, topics requiring current data, topics with legal/medical sensitivity, or word count constraints that conflict with topic depth.

Cost Efficiency

50/100 · 15%

1500 words is expensive but explicit. No token budget for research or iteration phases.

Failure Mode Safety

25/100 · 15%

No instruction to verify facts, cite sources, avoid plagiarism, or handle sensitive topics. Model could generate harmful, outdated, or hallucinated content.

Critical issues

'SEO-friendly' is undefined — what does success look like?

high

Add: 'Target keyword: [KEYWORD] with 1-2% density. Include the keyword in: title, H1, first paragraph, at least one H2, and conclusion. Use related terms: [TERM1, TERM2, TERM3].'

No target audience or reader persona defined

high

Add: 'Target audience: [PERSONA] who are trying to [GOAL]. Tone: [FORMAL/CONVERSATIONAL/TECHNICAL]. Reading level: [GRADE/EXPERT].'

No fact-checking or source requirements

high

Add: 'Do not invent statistics or quotes. If citing data, use only information provided in the source material. Flag any claims that need verification.'

No plagiarism or originality guidance

medium

Add: 'Write original content. Do not copy phrases verbatim from source material. Paraphrase and synthesize information.'

No handling for sensitive or regulated topics

medium

Add: 'If the topic involves medical, legal, or financial advice, include a disclaimer and recommend consulting a professional.'

Strengths

  • Word count is explicit, which helps with cost estimation.
  • Basic structure (intro, 3 sections, conclusion) provides a minimal skeleton.

Recommendations

  • Define SEO success criteria: target keyword, related terms, meta description length, internal linking strategy.
  • Add target audience persona, brand voice guidelines, and reading level.
  • Add fact-checking, plagiarism avoidance, and source citation rules.
  • Add handling for sensitive topics and current events that may require human review.

Key takeaway

SEO content generation is one of the most common AI use cases, but prompts like this produce generic, potentially plagiarized, and SEO-ineffective output. The gap between 'write a blog post' and 'produce rankable content' is massive. Adding keyword targets, audience definition, and quality gates turns this from a content mill into a legitimate content production tool.

Evaluate your own prompts and workflows

Run the same five-dimension analysis on any prompt, pipeline, or automation. Get a weighted score, critical issues, and concrete fixes in seconds.

Best flow: run one weak prompt and one strong prompt through the live evaluator, compare what changes in the scorecard, then upgrade only if prompt QA is becoming a recurring workflow.