Prompt Evaluation Benchmarks

Real teardowns showing how prompts and workflows score across five dimensions. Each benchmark includes the original input, dimension scores, critical issues, and concrete fixes.

3 weak4 usable1 strong4 workflow benchmarks

Start with the live comparison

The fastest proof on this page is now visible before the teardown archive: run the weak research prompt here, compare it against the strongest workflow demo on the same page, then choose the paid path only after you see the score gap.

Loading evaluator...

Choose the next step after the score

Keep the proof visible while the visitor decides what to do next

Once the weak-vs-strong difference is clear above, use the same-page chooser below to rerun the strongest demo, replace the canned text with your own prompt, take the lower-friction paid step, or jump straight to recurring QA.

Best first proof

Run the weakest prompt live

Watch a low-context research prompt score badly so the evaluator’s scoring model becomes obvious fast.

Run weak demo →

Best comparison path

Run the strongest workflow demo

Compare a structured meeting-notes workflow against the weak examples and see what changes in the scorecard.

Run strong demo →

Best for real self-test

Replace the demo with your own prompt

Use the weak result as the bridge, but jump straight into a same-page custom rerun instead of only watching canned examples.

Replace the demo with my own prompt →

Cheapest paid start

Start with the $19 prompt pack

If the weak benchmark already proves the problem, take the lower-friction one-time paid step instead of forcing a subscription decision first.

Start with the Founder AI Prompt Pack →

Best for recurring QA

Start with Pro

If prompt reviews are already part of an active workflow, skip the long read and go directly to the recurring paid path.

Compare free vs Pro →

Vague research summarizer

Verdict: weak

weak · 28/100

Original input

Summarize this article in 3 bullet points.

Dimension breakdown

Clarity & Ambiguity

15/100 · 25%

Single vague instruction with no output shape or role definition.

Context Completeness

5/100 · 20%

No role, no audience, no examples, no success criteria.

Automation Robustness

10/100 · 25%

No conditionals, no validation, no boundaries for edge cases.

Cost Efficiency

70/100 · 15%

Short prompt and constrained output length.

Failure Mode Safety

35/100 · 15%

No anti-hallucination guidance or uncertainty handling.

Critical issues

No explicit output format specified

high

Define the bullet structure: each bullet should start with a category label, contain one key finding, and stay under 30 words.

No role or operating context

medium

Set the model's frame: 'You are a research analyst producing executive summaries for a non-technical audience.'

No safety or boundary instructions

medium

Add: 'Do not add information not present in the article. Flag if the article is too short to summarize meaningfully.'

Prompt is probably too short for dependable automation

high

Expand to include context, expected output shape, edge-case handling, and quality constraints.

Strengths

• The prompt has enough material to analyze and iterate on.

Recommendations

• Improve clarity & ambiguity by adding clearer instructions, constraints, and examples.
• Improve context completeness by adding clearer instructions, constraints, and examples.
• Improve automation robustness by adding clearer instructions, constraints, and examples.

Key takeaway

A 9-word prompt looks efficient but guarantees inconsistent output at scale. Adding role, format, and guardrails takes it from 'hope for the best' to a repeatable process.

Run the weak research demo →Start with the $19 prompt pack →Compare free vs Pro →

Structured meeting notes extractor

Verdict: strong

strong · 82/100

Original input

You are an operations assistant for a small agency. Review the following client meeting notes and return: (1) a concise summary, (2) decisions made, (3) action items with owner, deadline, and priority, and (4) follow-up questions. Use markdown sections exactly in that order. If an owner or deadline is missing, return null instead of guessing. Do not invent facts that are not in the notes. If the notes are ambiguous, flag the uncertainty clearly before the relevant item.

Dimension breakdown

Clarity & Ambiguity

85/100 · 25%

Prompt includes explicit output structure and markdown section ordering.

Context Completeness

75/100 · 20%

Sets a role and operating frame, defines success criteria.

Automation Robustness

80/100 · 25%

Handles missing data (null returns) and ambiguous input (uncertainty flagging).

Cost Efficiency

65/100 · 15%

Reasonably compact for the task depth required.

Failure Mode Safety

90/100 · 15%

Explicit anti-hallucination instruction and uncertainty escalation.

Strengths

• Automation Robustness is reasonably strong.
• Failure Mode Safety is reasonably strong.
• Clarity & Ambiguity is reasonably strong.

Recommendations

• This prompt is in good shape. Next step: test it on 3-5 representative real inputs.

Key takeaway

This prompt scores 'strong' because it defines the output shape, handles missing data explicitly, and guards against hallucination. The one gap: no explicit length constraint on each section, which can cause cost drift on long meetings.

Run the strong meeting-notes demo →Compare free vs Pro →

Unbound customer support triage

Verdict: usable

usable · 58/100

Original input

You are a customer support agent. Read the customer message and classify it as one of: billing_issue, technical_bug, feature_request, or account_question. Then write a reply that acknowledges the problem, gives the most likely fix, and tells the customer what happens next. Be friendly but concise.

Dimension breakdown

Clarity & Ambiguity

70/100 · 25%

Classification categories are explicit, but reply format is unspecified.

Context Completeness

55/100 · 20%

Role is set, but no examples, no audience context, and limited success criteria.

Automation Robustness

35/100 · 25%

No handling for messages that don't fit any category, or for messages with multiple issues.

Cost Efficiency

60/100 · 15%

'Concise' is subjective — no word limit or section structure for the reply.

Failure Mode Safety

50/100 · 15%

No instruction for when the model is unsure about classification, or when 'most likely fix' could be harmful.

Critical issues

No fallback for unclassifiable messages

high

Add a fifth category like 'other' or 'needs_human' and instruct the model to use it when confidence is low.

No safety or boundary instructions

medium

Add: 'Never suggest actions that could damage the customer's account. If unsure about a technical fix, recommend contacting support instead.'

No explicit output format specified

medium

Define the reply template: greeting → classification → fix → next steps, with max lengths per section.

Strengths

• Clarity & Ambiguity is reasonably strong.

Recommendations

• Improve automation robustness by adding conditionals for edge cases and multi-issue messages.
• Improve failure mode safety by adding constraints on what the model should not do or say.

Key takeaway

Classification + freeform reply is a common pattern, but leaving the reply unbounded means every output will be shaped differently. Adding a reply template and an 'uncertain → escalate' rule takes this from 'usable' to 'strong'.

Run the support-triage demo →Compare free vs Pro →

Multi-step agent workflow

Verdict: usable

usable · 55/100

Original input

You are a research agent. Find information about the company I specify. First, search the web for the company name and collect the top 5 results. Then, for each result, extract key facts about the company: founding year, industry, revenue if available, and number of employees. Finally, compile all the facts into a summary report.

Dimension breakdown

Clarity & Ambiguity

60/100 · 25%

Multi-step structure is defined, but 'key facts' and 'summary report' are vague — no output format, no section ordering, no length constraints.

Context Completeness

50/100 · 20%

Role is set, but no audience, no use case, no success criteria for what makes a 'good' summary.

Automation Robustness

30/100 · 25%

No handling for: company not found, conflicting facts across sources, missing data fields, or search returning no results. Each step can fail silently.

Cost Efficiency

40/100 · 15%

'Collect the top 5 results' and extract facts from each is open-ended — no token ceiling per step or per result.

Failure Mode Safety

45/100 · 15%

No instruction to flag uncertain data, verify facts across sources, or handle contradictory information. Agent could present unverified data as fact.

Critical issues

No error handling for failed search steps

high

Add: 'If the search returns no results, return { "status": "no_results", "company": "[name]" } and stop. Do not fabricate company information.'

No cross-source verification instruction

high

Add: 'If two sources contradict on a fact, flag the conflict and include both values with source attribution instead of picking one.'

No output format for the summary report

medium

Define the report template: Company Overview → Key Facts Table → Source Attribution → Confidence Notes. Use markdown with specific section headers.

'Key facts' is undefined — which facts matter?

medium

Replace 'key facts' with an explicit list: 'founding_year, industry, headquarters_city, latest_revenue, employee_count, ceo_name'. Mark any unavailable field as 'data_unavailable' instead of skipping it.

No token or length budget per step

low

Add per-step limits: 'Extract no more than 50 words per result. The final summary must not exceed 500 words.'

Strengths

• The multi-step structure (search → extract → compile) is explicitly defined.
• A role is set, which frames the agent's operating mode.
• The prompt decomposes a complex task into sequential steps, which is better than a monolithic request.

Recommendations

• Add error handling and stop conditions for each step (search fails, no results, conflicting data).
• Replace vague terms ('key facts', 'summary report') with explicit field lists and output templates.
• Add cross-source verification instructions to prevent presenting unverified data as fact.
• Add per-step token budgets to control cost on large result sets.

Key takeaway

Multi-step agent prompts look powerful because they decompose a task. But without error handling at each step, cross-source verification, and defined output formats, they fail silently in production. The most dangerous prompt is one that looks like it should work but has hidden gaps that only appear under messy real-world conditions.

Run the research demo →Compare free vs Pro →

Automated invoice processing pipeline

Verdict: usable

usable · 61/100

Original input

INVOICE PROCESSING WORKFLOW

Step 1 — Extract: Read the uploaded invoice document and extract: vendor_name, invoice_number, invoice_date, line_items (array of {description, quantity, unit_price, total}), and total_amount.

Step 2 — Validate: Check that extracted totals match. If line_item totals don't sum to the stated total_amount, flag the discrepancy as {status: "mismatch", delta: NUMBER}. If any required field is missing, return {status: "missing_field", field: "NAME"} and stop.

Step 3 — Categorize: Based on the vendor name and line item descriptions, assign a spending category from: software_subscription, professional_services, infrastructure, marketing, office_supplies. Output as {category: "CATEGORY", confidence: 0.0-1.0}.

Step 4 — Route: If total_amount > $10,000, output {needs_approval: true, approver: "CFO"}. If total_amount < $500, output {needs_approval: false, auto_approve: true}. Otherwise output {needs_approval: true, approver: "Manager"}.

Step 5 — Output: Produce a structured JSON object with all extracted and computed fields combined. Use null for any field that could not be determined.

Dimension breakdown

Clarity & Ambiguity

75/100 · 25%

Five explicit steps with named outputs and clear transitions. Each step's expected output format is specified.

Context Completeness

55/100 · 20%

Step purposes are clear, but no mention of what to do with non-PDF formats, scanned invoices, or non-English documents.

Automation Robustness

55/100 · 25%

Some error handling exists (missing fields, totals mismatch), but no handling for corrupted files, password-protected PDFs, or duplicate invoices.

Cost Efficiency

60/100 · 15%

Pipeline is reasonably scoped per step. No token budgets defined, but task complexity keeps context usage bounded.

Failure Mode Safety

50/100 · 15%

No instruction for what happens if the model can't extract anything useful — it may still produce partial output with fabricated fields.

Critical issues

No handling for unreadable or non-PDF invoices

high

Add: 'If the document cannot be parsed (corrupted, scanned image with no text layer, or password-protected), return {status: "unreadable", reason: "TYPE"} and stop without fabricating data.'

No duplicate detection across invoice numbers

high

Add: 'Before processing, check if this invoice_number has been seen before. If yes, return {status: "duplicate", existing_record: {...}} and halt processing.'

No output format defined for the final Step 5

medium

Define the exact JSON schema for the final output, including all field names, types, and null handling for missing data.

Confidence scoring is subjective without calibration

medium

Define what confidence means: 'Use confidence only when at least two line items clearly match a known category pattern. Set confidence < 0.5 if any line item description is ambiguous.'

Strengths

• The pipeline breaks a complex task into five discrete, sequential steps with explicit stop conditions.
• Validation step catches a specific data integrity failure (totals mismatch) before it propagates.
• Routing step adds a business-rule gate that maps dollar amounts to approval workflows.

Recommendations

• Add format detection and rejection for non-processable document types.
• Add duplicate detection by invoice_number before processing starts.
• Define an explicit JSON schema for the Step 5 final output.
• Add a confidence calibration guide so category confidence scores are consistent.

Key takeaway

This pipeline is well-structured for its happy path, but production invoice processing encounters scanned documents, duplicates, and OCR failures constantly. Adding a 'can I even read this?' gate before Step 1 is the single highest-ROI improvement.

Run the structured workflow demo →Compare free vs Pro →

Slack alert escalation agent

Verdict: weak

weak · 31/100

Original input

Monitor our monitoring system for alerts. When an alert fires, figure out how serious it is and notify the right person on Slack.

Dimension breakdown

Clarity & Ambiguity

15/100 · 25%

No definition of what 'alerts' look like, what format they arrive in, or what 'serious' means. No Slack message template specified.

Context Completeness

20/100 · 20%

No on-call schedule, no severity tiers, no escalation path, no SLA definitions, no list of stakeholders.

Automation Robustness

10/100 · 25%

No polling interval, no handling for Slack API failures, no handling for duplicate alerts, no handling for monitoring system downtime.

Cost Efficiency

30/100 · 15%

Open-ended monitoring loop with no token budget and no deduplication could rack up massive API costs.

Failure Mode Safety

20/100 · 15%

Automated Slack messages that 'notify the right person' without validation could spam the wrong team, miss a P0 incident, or expose sensitive alert data to the wrong channel.

Critical issues

No alert format or severity definition

high

Define: 'Alerts arrive as JSON with fields: service_name, severity (p0/p1/p2/p3), message, timestamp. Map severity to response SLA: P0 → 5 min, P1 → 30 min, P2 → 4 hours, P3 → next business day.'

No on-call or escalation path defined

high

Add: 'Use the on-call rotation at oncall.example.com. If primary is unreachable after one Slack DM and one Slack ping, escalate to secondary. After 2 failed escalations, page the engineering manager directly.'

No Slack message template or safety constraints

high

Define the Slack message format, maximum alert age before suppression, and rate-limit rules. Add: 'Never send more than 1 Slack message per alert per 5 minutes, even if the alert re-fires.'

No handling for monitoring system failures

medium

Add: 'If the monitoring system returns an error or times out after 10 seconds, send an alert to the on-call engineer with subject: MONITORING_SYSTEM_DOWN.'

Strengths

• The intent is clear: something fires, something gets notified.

Recommendations

• Define alert format, severity tiers, and SLA mapping before anything else.
• Build the on-call and escalation matrix first — it's the load-bearing component.
• Add a Slack message template with required fields, character limits, and retry/backoff logic.
• Add monitoring-system health check as a self-healing mechanism.

Key takeaway

This is the most common automation failure mode: a vague intent wrapped in confident language. 'Figure out how serious it is' sounds like AI but provides zero actual logic. Every term needs a concrete definition before this can run unattended.

Run the weak-agent demo →Start with the $19 prompt pack →Compare free vs Pro →

Email outreach personalization pipeline

Verdict: usable

usable · 52/100

Original input

You are a sales assistant. I will give you a LinkedIn profile URL and the person's name. Generate a personalized cold email that: (1) mentions something specific from their recent activity, (2) connects it to our product, and (3) ends with a soft ask. Keep it under 150 words.

Dimension breakdown

Clarity & Ambiguity

55/100 · 25%

Structure is defined (3 parts), but 'something specific from recent activity' and 'soft ask' are undefined. No email template or tone guidance.

Context Completeness

40/100 · 20%

Role is set, but no product description, no target audience definition, no example emails, and no definition of what makes a 'good' personalization.

Automation Robustness

25/100 · 25%

No handling for: LinkedIn profile not found, no recent activity, private profile, or when the person has no clear connection to the product. Pipeline will produce output regardless of input quality.

Cost Efficiency

70/100 · 15%

Word limit of 150 helps constrain output length. Pipeline is reasonably scoped.

Failure Mode Safety

45/100 · 15%

No instruction for when to NOT send an email. Model could generate creepy or irrelevant personalizations that damage brand reputation. No quality gate before the email is 'ready to send'.

Critical issues

No handling for missing or private LinkedIn profiles

high

Add: 'If the LinkedIn profile is private, not found, or has no public recent activity, return {status: "cannot_personalize", reason: "TYPE"} instead of fabricating personalization.

No product context provided to the model

high

Add product context block: 'Our product is [X] which helps [Y] achieve [Z]. Only connect their activity to relevant use cases.'

No definition of 'soft ask' or email structure template

medium

Define: 'A soft ask is a low-friction CTA like 'mind if I share a 2-min overview?' or 'worth a quick chat?'. Structure: hook (their activity) → bridge (to product) → soft ask → sign-off.

No quality gate before marking email 'ready'

medium

Add: 'Before finalizing, check: (1) Is the personalization factually grounded in their profile? (2) Is the connection to our product plausible? If either is NO, return {status: "low_confidence", suggestions: [...]}. Only return a complete email when both pass.

No tone or brand voice guidelines

low

Add: 'Write in a conversational, non-salesy tone. Avoid buzzwords, superlatives, or generic compliments like 'impressive background'.

Strengths

• The three-part structure (mention → connect → ask) provides a clear email skeleton.
• Word limit of 150 prevents overly long emails that reduce response rates.
• Role definition frames the assistant's operating mode.

Recommendations

• Add explicit handling for edge cases (private profiles, no activity, irrelevant targets).
• Provide product context so the model can make legitimate connections.
• Define 'soft ask' and add a template structure for the email body.
• Add a quality gate that validates personalization before returning output.

Key takeaway

Cold email personalization at scale fails when the model fabricates connections or produces generic fluff. The highest-risk failure mode is an email that looks personalized but actually feels creepy or irrelevant. Adding a 'should I even send this?' gate and grounding checks would move this from 'usable' to 'strong'.

Run the weak outreach demo →Compare free vs Pro →

Evaluate your own prompts and workflows

Run the same five-dimension analysis on any prompt, pipeline, or automation. Get a weighted score, critical issues, and concrete fixes in seconds.

Run the weak-prompt demo →Run the strong-prompt demo →Get Pro after testing →

Best flow: run one weak prompt and one strong prompt through the live evaluator, compare what changes in the scorecard, then upgrade only if prompt QA is becoming a recurring workflow.

Prompt Evaluation Benchmarks

Keep the proof visible while the visitor decides what to do next

Run the weakest prompt live

Run the strongest workflow demo

Replace the demo with your own prompt

Start with the $19 prompt pack

Start with Pro

Vague research summarizer

Dimension breakdown

Critical issues

Strengths

Recommendations

Structured meeting notes extractor

Dimension breakdown

Strengths

Recommendations

Unbound customer support triage

Dimension breakdown

Critical issues

Strengths

Recommendations

Multi-step agent workflow

Dimension breakdown

Critical issues

Strengths

Recommendations

Automated invoice processing pipeline

Dimension breakdown

Critical issues

Strengths

Recommendations

Slack alert escalation agent

Dimension breakdown

Critical issues

Strengths

Recommendations

Email outreach personalization pipeline

Dimension breakdown

Critical issues

Strengths

Recommendations

SEO content generation pipeline

Dimension breakdown

Critical issues

Strengths

Recommendations

Evaluate your own prompts and workflows