Agentic Workflow Design Patterns: When Agents Beat Simple Prompts

The most common architecture mistake in AI engineering today is reaching for an agent when a single LLM call would do. Agents are not free. They add latency, multiply token spend, complicate debugging, and introduce failure modes (loops, tool errors, runaway costs) that one-shot prompts simply do not have.

Anthropic's "Building Effective Agents" piece (published late 2024 and still the cleanest taxonomy out there) makes the point bluntly: most production AI features should be workflows of LLM calls, not agents. This post walks through the five canonical patterns from that taxonomy with concrete worked examples, cost and latency tradeoffs, and a decision rule for each.

The agent overhead tax

Before we get to the patterns, an honest accounting of what agentic systems cost you over a single-prompt baseline:

| Cost | Single LLM call | Agentic workflow | |------|-----------------|------------------| | Tokens | 1x | 3x to 20x | | Latency p50 | 1 to 3 seconds | 8 to 60 seconds | | Failure modes | model error, content filter | tool error, loop, budget exhaustion, mid-trajectory hallucination, state corruption | | Debuggability | single trace | multi-span trace with state diffs | | Eval surface | prompt + output | prompt + every step + final output + tool calls |

Pay this tax only when the task genuinely needs it. The five patterns below are the legitimate uses.

Pattern 1: Prompt chaining

Description: A fixed sequence of LLM calls where each call's output feeds the next. No dynamic branching, no tool use, just deterministic stages.

When to use: The task decomposes cleanly into sub-steps and the intermediate outputs are short enough that running them in one prompt would hit quality issues but separating them keeps each prompt focused.

When not to use: The task is small enough that a single well-structured prompt with chain-of-thought instructions and few-shot examples performs equivalently. Test the one-prompt version first.

Worked example: Marketing brief to LinkedIn post.

Call 1: extract 3 to 5 talking points from the brief
Call 2: rewrite each talking point in a punchy first-person voice
Call 3: assemble into a 200-word post with hook, body, and CTA
Optional gate between Call 2 and Call 3: a programmatic check that each rewritten point is under 30 words. If not, retry Call 2 once.

Cost implication: 3x tokens of the single-prompt baseline. Latency 3x. Quality usually meaningfully better because each stage stays focused.

Pattern 2: Routing

Description: A classifier LLM (or cheap deterministic classifier) routes input to one of N specialized downstream prompts or models.

When to use: Input categories have meaningfully different prompts, models, or tool sets. Common in customer support (technical vs billing vs sales), code review (security vs style vs correctness), and multi-tenant SaaS where each tenant has a custom system prompt.

When not to use: All paths share 80 percent of the same prompt. Routing adds latency without quality gain. Use a single prompt with conditional sections instead.

Worked example: Customer support triage.

Cheap classifier (Claude Haiku 4 or GPT-4o-mini, temperature 0) labels input as billing, technical, account, or other
Each label maps to a specialized agent with its own system prompt, tools, and escalation policy
Billing routes to an agent with read access to Stripe; technical routes to one with read access to the support KB and product telemetry; account routes to one with Entra ID lookup tools

Cost implication: One extra classifier call (~$0.0001 with Haiku). Latency increase of 200 to 500ms. Quality lift is substantial when downstream prompts are genuinely different.

Pattern 3: Parallelization

Description: Multiple LLM calls run concurrently on the same input (sectioning) or the same call is run multiple times and results are voted (voting). Results are then aggregated.

When to use:

Sectioning: distinct sub-questions can be answered independently. Example: a legal review where one call checks IP clauses, another checks liability, another checks termination, all on the same contract.
Voting: high-stakes classification where you want diversity to surface false negatives. Example: content moderation, where five parallel calls vote on whether a post violates policy.

When not to use: Sub-tasks are sequential dependencies. Voting is overkill for low-stakes classification.

Worked example: Contract risk extraction.

```python import asyncio from anthropic import AsyncAnthropic

client = AsyncAnthropic()

async def check_section(contract, section_focus): msg = await client.messages.create( model="claude-opus-4-5", max_tokens=2000, system=f"You are reviewing a contract for {section_focus} risks only. Output JSON: [{{clause, risk, severity}}].", messages=[{"role": "user", "content": contract}], ) return section_focus, msg.content[0].text

async def parallel_review(contract): focuses = ["IP and licensing", "Liability and indemnification", "Termination and renewal", "Data and privacy", "Payment terms"] results = await asyncio.gather(*[check_section(contract, f) for f in focuses]) return dict(results) ```

Cost implication: 5x tokens of a single review pass. Latency stays close to a single call (parallel execution). Quality is usually higher because each call has narrower focus.

Pattern 4: Orchestrator-workers

Description: A central LLM (the orchestrator) dynamically plans sub-tasks and delegates each to a worker LLM, then synthesizes results. Unlike parallelization, the sub-tasks are determined at runtime, not pre-defined.

When to use: Task complexity is unknown until you see the input. Code generation across multiple files is the canonical example: you do not know up front which files need changes until the orchestrator reads the codebase.

When not to use: You can pre-define the sub-tasks. Then parallelization is cheaper and more predictable. Also avoid this if you cannot bound the orchestrator's loop (it will spend budget).

Worked example: Multi-file code refactor.

Orchestrator receives "Rename function `getUser` to `fetchUser` across the repo"
Orchestrator calls a search tool, identifies 14 files referencing `getUser`
Orchestrator spawns 14 worker calls, each handling one file's edits
Orchestrator runs a final synthesis pass: read the diff summaries, identify cross-file inconsistencies, decide if any worker needs a re-run
Hard stop: max 3 worker re-runs total, max 60 seconds of wall-clock budget

Cost implication: 5x to 20x baseline tokens depending on fan-out. Latency 10 to 30 seconds even with parallel workers. Adds a hard requirement for budget enforcement.

Pattern 5: Evaluator-optimizer

Description: One LLM generates a candidate output, a second LLM (the evaluator) critiques it against criteria, the first LLM revises. Loop until the evaluator approves or a max-iteration cap is hit.

When to use: Output quality is hard for the generator to self-assess but a separate evaluator with different framing can catch issues. Translation, technical writing, and legal drafting fit this well. The pattern works especially well when the evaluation criteria can be written as a checklist.

When not to use: The generator and evaluator share the same blind spots. Two GPT-4 instances often agree on bad output that a human would catch. Mitigate by using a different model family as the evaluator, or by including deterministic checks (linter, schema validator, fact-check tool) alongside the LLM evaluator.

Worked example: Technical documentation generation.

Generator (Claude Opus 4.5) drafts a how-to from a spec
Evaluator (GPT-4o, deliberately different family) scores the draft on: technical accuracy, code-block runnability, completeness, voice. Output: pass or revise + specific fixes
If revise: generator gets the critique and produces v2
Max 3 iterations. If still failing, route to a human.

Cost implication: 2x to 6x tokens. Latency 2x to 4x. Quality lift is meaningful on tasks where the evaluator can find issues the generator cannot.

Picking the right pattern: a decision tree

``` Q1: Can a single prompt with good structure (role, format, examples) hit your quality bar? Yes -> Use a single prompt. Stop. No -> Continue.

Q2: Does the task decompose into a fixed, known sequence of steps? Yes -> Prompt chaining. No -> Continue.

Q3: Are inputs heterogeneous and benefit from category-specific handling? Yes -> Routing. No -> Continue.

Q4: Can the work be split into independent parallel sub-tasks known in advance? Yes -> Parallelization (sectioning or voting). No -> Continue.

Q5: Do you need dynamic sub-task planning based on input content? Yes -> Orchestrator-workers (with strict budget caps). No -> Continue.

Q6: Is the bottleneck quality refinement after generation? Yes -> Evaluator-optimizer. No -> You probably need a true autonomous agent with tool use, not a workflow. Reconsider scope. ```

Combining patterns

Real production systems often compose multiple patterns. A customer support pipeline might route by category (Pattern 2), then within the technical category use prompt chaining (Pattern 1) to triage then diagnose then propose a fix, then run the proposed fix through an evaluator-optimizer (Pattern 5) before showing it to the user. This is not "agentic" in any meaningful sense; it is a well-designed workflow with three patterns stacked.

The composition rule: each layer of patterns is an explicit cost. A 4-pattern stack with average 1.5 LLM calls per pattern is 6 calls per request. At 50k input tokens per call (a reasonable size for a customer-context agent), that is 300k tokens per user request. Make sure the quality lift justifies it. Run an ablation where you remove one layer at a time and measure quality on your eval set.

A clean test: if you can remove a pattern and quality drops by less than your tolerance threshold, remove it. Most teams discover one or two layers of waste this way.

Anti-patterns to avoid

A few recurring mistakes worth calling out:

The "let the agent figure it out" trap. Engineers often default to orchestrator-workers because it feels powerful. In practice, when you can pre-define the sub-tasks (which is most of the time), prompt chaining or parallelization is cheaper, more predictable, and easier to evaluate. Reserve orchestrator-workers for genuine open-ended problems.

The evaluator that agrees with the generator. If your evaluator-optimizer loop is using the same model family as the generator, you will get rubber-stamping. Either use a different family (Claude as evaluator of GPT output or vice versa) or pair the LLM evaluator with deterministic checks. A schema validator catches output structure issues that no LLM evaluator will reliably notice.

The router that does not actually route. If 90 percent of your traffic ends up in one branch, you do not need a router; you need a single agent with a fallback for the 10 percent edge cases. Measure routing distribution before you commit to the routing pattern.

The parallel call that is secretly sequential. `asyncio.gather` only parallelizes if the underlying API supports concurrent requests at your rate limit. If you are hitting per-minute caps, the calls serialize and you pay the multi-call cost without the latency win. Confirm with a wall-clock benchmark, not theory.

A note on "true" agents

Beyond these five workflow patterns sits the true autonomous agent: an LLM in a loop with tool access, deciding its own next step until task completion or budget exhaustion. Reserve this category for tasks where you genuinely cannot pre-define the decision graph. Customer-facing agentic search, autonomous code modification, and adversarial security testing fit. Most "agentic" features in product roadmaps do not. The right answer is usually one of the five workflows above with a clear topology.

Checklist before you ship any agentic pattern

[ ] Implemented and benchmarked the single-prompt baseline first
[ ] Logged token spend per request type in your observability platform
[ ] Set a hard budget cap (tokens, wall-clock seconds, or both) on every multi-call workflow
[ ] Documented the failure mode for each step and what user-facing behavior triggers on failure
[ ] Wrote at least 20 evaluation cases that exercise both happy path and edge cases
[ ] Confirmed the workflow beats the baseline on quality metrics that matter to the user

For the SLI/SLO design that backs the budget caps and eval cases above, agent-observability-metrics covers the metrics layer. If you are picking a model family to run these patterns on, claude-ai-vs-chatgpt-enterprise-comparison compares Claude and ChatGPT for agentic workloads specifically.

Next steps

Re-audit one of your current AI features against the decision tree above. Most teams find that at least one feature is over-engineered with an agent when a workflow pattern would deliver better quality at a fraction of the cost. We help engineering teams refactor agentic systems for cost and reliability. Reach out if you want a code review against the five patterns.

View All Insights