Workload Economics

Chapters 10–14

Part 2 gave you a vocabulary: roofline, batch frontier, KV capacity, cache break-even, goodput, productive capacity. Those concepts describe what the hardware and software can do. Part 3 asks the question that comes next: what does your workload actually need?

The answer determines everything downstream. A chat workload with human-waiting latency SLOs needs different routing, caching, and capacity from a batch extraction pipeline that runs overnight. A coding agent whose cost hides in sub-agent fanout and repair loops needs different measurement from a single-turn embedding job. A voice system with a 300ms inference budget needs different physics from a document summarization pipeline with a 30-second tolerance.

Part 2 built the supply side. Part 3 builds the demand side. Together they produce a routefit matrix: given this workload identity, which routes are feasible, which are wasteful, and which need measurement before committing.


Chapter 10: Workload Classes That Change Treatment

The field problem

Four inference workloads run through one API provider with one model: support chat, document extraction, nightly eval runs, and an experimental coding agent. They share the same billing surface, the same rate limits, and the same cost model. The monthly bill arrives. Total spend divided by total requests produces an average cost per request. That average describes none of the four workloads accurately.

The mechanism

A workload class is a category that changes at least one of these: routing, fallback, latency SLO, quality gate, billing surface, caching strategy, monitoring threshold, or operational owner. If a distinction does not change any of those, it is not a workload class—it is a label.

The minimum set of workload classes for most production inference operations:

Class

Defining constraint

What changes

Human-waiting synchronous

p95 latency SLO under 5s

Route, model size, batch config, fallback path

Human-waiting streaming

Time-to-first-token SLO

Prefill priority, chunked transfer, backpressure

Machine-waiting synchronous

Latency SLO 1-30s

Can tolerate larger batch, longer queue, cheaper route

Batch/offline

No per-request latency SLO

Eligible for batch API, off-peak scheduling, retry tolerance

Agentic/multi-turn

Task-lifecycle SLO

Fanout accounting, compaction, cache strategy, repair cost

Real-time/voice

Sub-second hard ceiling

Dedicated capacity, streaming, thermal stability

The binding constraint differs by who is waiting. When a human waits—a chatbot reply, a voice response, a code completion—token generation speed dominates. When a machine orchestrates a multi-step task on a queue or an overnight schedule, memory capacity and cost per accepted result dominate, and latency is secondary. This distinction shapes routing and hardware decisions in Parts 4 and 5, but it is not binary: an interactive coding agent has a human in the loop but still runs 20-step tool chains where per-step latency compounds. The useful question is not “answer or agent?” but “what fraction of this workload’s cost is latency-sensitive?” (Thompson, “The Inference Shift,” Stratechery, 2026-05-11.)

The naive answer

“We have one workload—inference.”

This collapses the supply-demand fit. A 50ms TTFT SLO and a next-day batch job cannot share a cost model without one of them getting wrong treatment: the batch job subsidizes the latency premium, or the real-time workload gets routed to a batch-optimized endpoint and violates SLOs.

The better model

Each workload class gets its own row in the workload identity schema. The canonical fields are: routing class (which model or model tier), latency SLO (p50/p95/p99 TTFT and end-to-end), quality gate (eval threshold or human-review trigger), billing surface (per-token, batch, embedding, dedicated), caching strategy (prefix share, TTL, eviction policy), token profile (input/output mean and tail), and owner (the team responsible for SLO violations and cost overruns).

The schema forces the team to state what the workload needs. Not what the average workload needs. What this workload needs. The fields come from Part 2: token counts determine KV pressure, cache prefix tokens determine caching economics, latency SLOs determine batch feasibility, quality gates determine the accepted-work denominator.

Decision rule

If two workloads share the same routing, latency SLO, quality gate, billing surface, caching strategy, and owner, merge them into one class. If any of those differ, they are separate classes. The test is not “do they call the same model?” The test is “would changing the route for one break the other?”

What to measure

  • Cost per accepted output token by workload class, not blended

  • Latency percentiles by workload class

  • Cache hit rate by workload class

  • Quality gate pass rate by workload class

  • Retry and repair rate by workload class

Where this breaks

The classification breaks when workload classes drift. A support chat workload gets extended to handle complex multi-step troubleshooting. The token profile changes, the latency profile changes, the cache hit rate drops, and the quality gate needs updating. The workload identity says “support-chat” but the traffic looks like an agent. Review workload identities quarterly or when failure rates change.

Calculator hook

input template. One row per workload class. The calculator uses workload identity fields to select the LCPR formula variant, route candidates, and sensitivity parameters.


Chapter 11: Conversational Workloads

The field problem

A customer support system answers policy questions, looks up account state, and takes actions like issuing refunds or escalating tickets. The average request is 3,000 input tokens and 400 output tokens. But the p95 request is 9,000 input tokens and 1,200 output tokens—a customer with a long conversation history, multiple tool calls, and a complex resolution path. The average LCPR says the model is affordable. The p95 LCPR says the fallback path is activating on the most expensive customers.

The mechanism

Chat and support workloads have three economic properties that distinguish them from other classes:

1. Context accumulates within a session. Each turn adds the prior conversation to the input. By turn 8, the input can be 4-6x the initial prompt. Without caching, every turn re-pays for the full history. With caching, the economics depend on session duration, cache TTL, and whether the cached prefix is stable.

2. Tool calls add hidden cost. A support agent that looks up an account, checks order status, and issues a refund makes three tool calls. Each tool response enters the context for the next turn. Tool output tokens are free to generate (the tool produces them, not the model) but expensive to consume (they become input tokens on the next turn).

3. Quality failures cost more than inference. A wrong refund, a privacy breach, or a policy violation costs orders of magnitude more than the model call. The quality gate is not optional, and the quality gate has a cost: deterministic state checks, policy rubric evaluation, human review for high-risk actions.

The naive answer

“Output tokens are expensive, so use a smaller model.”

The smaller model may produce more tool calls, longer chains of reasoning, more repairs, and more human escalations. The LCPR per accepted resolution can increase even when the per-token price decreases.

The better model

Model the session, not the request.

Cache economics dominate sessions longer than 3-4 turns. The cache break-even from Part 2 Chapter 7 applies: if the system prompt and tool definitions are stable across turns, the cached prefix saves re-processing on every turn. The savings compound because the cached portion grows as the conversation grows.

Worked example

A regulated-industry support deployment ran Sonnet-class on customer-facing tickets and tested Haiku-class on a 12% canary over 5 weeks. The token-price gap was about 8:1. The LCPR per accepted resolution gap landed at 1.18:1. Haiku-class was 18% more expensive per accepted resolution. The components:

Metric

Model A (Sonnet-class)

Model B (Haiku-class)

Avg turns/session

4.3

5.1

Cache hit rate

0.71

0.74

Raw resolution rate

89%

83%

Repair rate

4.2%

6.8%

Human escalation rate

2.7%

4.1%

LCPR per accepted resolution

$0.037

$0.044

Two things did not move where the playbook said they would. The smaller model's cache hit rate was higher, not lower. Shorter outputs left more headroom in the KV pool for cross-session prefix reuse, and the team did not predict that. The resolution-rate gap was 6 points, not 13.

What ate the savings was the escalation tail. The 1.4-point escalation-rate gap, multiplied by roughly $2.20 per escalation in handler time and platform cost, accounted for most of the loaded-cost difference. The smaller model was cheaper per token. It was more expensive per accepted resolution by a margin that took a 5-week canary to surface.

The team kept Sonnet on customer-facing tickets and moved Haiku-class to internal agent-handler workflows where the escalation cost was zero.

Decision rule

For support workloads, optimize for cost per accepted resolution, not cost per token. Measure session length, repair rate, and human escalation rate alongside token cost. If a cheaper model increases repairs enough to offset the token savings, the optimization failed.

What to measure

  • Session length distribution (turns and tokens)

  • Cache hit rate by session length

  • Resolution rate (first-pass, after repair, after human review)

  • Human escalation rate and review minutes

  • Tool call count and tool output token volume

  • Policy violation rate

Where this breaks

This model breaks when the support workload mixes simple lookups and complex resolutions. A “what is your return policy?” question should not carry the cost model of a multi-step account recovery. Split the workload class if the bimodal distribution is large enough to distort routing.

Calculator hook

with session accumulation: turn count, context growth curve, cache hit rate by turn, tool calls per turn, quality gate cost, repair rate, escalation rate.


Chapter 12: Agentic Workloads

The field problem

A product team builds two features. Feature A: a search-and-answer system that retrieves documents, generates a response, and returns it. Feature B: a coding assistant that reads files, plans changes, writes code, runs tests, reads errors, revises, and runs tests again. Both features call the same model API. The monthly bill is $40,000. The team blames the model price.

The trace tells a different story. Feature A makes 1.1 model calls per user request. Feature B makes 23 model calls per user task, with a standard deviation of 40. Feature A’s cost is dominated by output tokens. Feature B’s cost is dominated by input token growth across turns, sub-agent calls, tool output ingestion, cache misses after tool mutations, and repair loops.

The mechanism

Answer inference is a function. Input goes in, output comes out, request count is 1 (or 1 plus a retry), and the economic levers are model selection, prompt engineering, caching, and output length.

Agentic inference is a loop. The model observes, plans, acts, observes the result, and decides whether to continue. Request count is variable because it depends on task difficulty, tool availability, model capability, error handling, and termination policy. Sub-agent architecture, tool-output management, and compaction strategy stack on top of the answer-inference levers.

The denominator changes with the loop. For answer inference, the denominator is requests; a task may take 1 call or 2. For agentic inference, the denominator is accepted tasks, and a task may take 5 calls or 500 depending on the trajectory. Cost forecasts that work for the first will silently break on the second.

The naive answer

“Agentic is just more requests. Price per token still applies.”

Price per token applies to each call. But the number of calls, the context growth per call, the cache hit rate across calls, and the acceptance rate of the final output are all variable. Optimizing per-token price without controlling fanout, context growth, and acceptance rate can increase total cost.

The better model

Use the task lifecycle cost from the agentic economics research. The per-task cost is the sum of per-call costs across the trajectory, divided by the acceptance rate: cost_per_accepted_task = (calls_per_task × avg_tokens_per_call × blended_token_price) / accept_rate. Each variable in that numerator is independently observable, and each is a separate lever.

The fanout multiplier exposes how much hidden work a single user request creates. It is the ratio of total model calls per accepted task to the naive expected calls (typically one for chat, one or two for simple tool use). A fanout of 8.7 on an agent designed around an expected 3 means the system is doing 2.9x the work the cost model assumed.

Token usage per task on agentic workloads is multiples-of, not increments-of, single-turn chat token usage. Published measurements span a wide range. Single-agent loops typically consume 3-8x the per-turn token volume of the same-prompt completion, because the agent loop re-passes prior turns and tool outputs. Multi-agent systems compound the multiplier through sub-agent fanout; Anthropic reports about 4x for single-agent and about 15x for multi-agent research workloads. Long-horizon code-task benchmarks (SWE-bench-style, 20-80 turns per task) can consume hundreds-to-thousands of times the token volume of a single-turn code completion, but that ratio is mostly structural, not a sign of agent inefficiency. A 30-turn solve is a different task from a one-shot answer, not an inefficient version of one [MEASURED: arXiv 2604.22750, “How Do AI Agents Spend Your Money?”, 2026].

The actionable variable is not the ratio. It is whether your agent's per-task fanout matches your cost budget. If yes, the architecture is fine. If not, the lever is sub-agent design and context compaction, not token price.

Sub-agent specialization. Not every step in an agent loop needs a frontier model. A pattern that has shown up in published case studies: identify a frequently repeated, objectively scorable subtask in the agent loop, then route it to a smaller specialized model. The subtask must clear three bars. It must be invoked often enough that the routing logic earns its complexity, typically more than 10% of agent calls. It must have a deterministic correctness check so the specialist's output can be graded automatically. And it must have a stable input distribution so the specialist does not need frequent retraining.

Reported case studies of this pattern include spreadsheet-retrieval specialists in fintech agents (sub-10B active-parameter models matching frontier accuracy on the held-out retrieval eval) and similar work at code-search and structured-extraction-heavy products. Specialization works when those three bars are cleared. When they are not, for open-ended planning, ambiguous tool selection, or multi-step reasoning, the specialist is confidently wrong on out-of-distribution inputs and adds cost without quality.

Worked example

A mid-cap fintech ran an internal agentic coding agent serving about 180 engineers across the platform org. The agent handled refactors, test-writing, and cross-service migrations. Volume was bimodal: roughly 93% of tasks ran under 8K total tokens; about 7% of tasks ran north of 280K tokens. The same product had been steady-state at $52-58K monthly inference for the prior quarter.

The page came in at 2:14am on a Friday. The internal SLO dashboard fired on p95 fanout: it had been 4.2 calls per task three weeks earlier and was now reading 8.7 calls per task. The drift had grown over the prior 14 hours. Total request volume was flat. Median fanout was flat at 3.1. Cache hit rate was flat at 0.79. Quality pass rate had ticked down half a point but inside the normal weekly band.

The on-call's first move was the obvious one: a tokenizer change. The provider had pushed a minor SDK update earlier that week, and the mailing list was complaining about input-token accounting drift on Unicode-heavy diffs. Per-task token counts against the provider invoice reconciled within 0.3%. That wasn't it.

Next was a silent model swap on the provider side. The on-call paged the account contact, who confirmed no SKU change in the last 30 days. A fixed set of canary tasks reran through the same API key and produced identical traces against a baseline from a month earlier. That wasn't it either.

The third look was internal. Two weeks earlier the team had shipped a "verify-pass" feature: after the primary fix attempt, an additional model call checked whether the edit had broken adjacent files. The flag rolled out to 10% of users in week one, then 100% in week two. Code review had estimated 2-4 calls per task for verify-pass, and for the median task that was what the system was doing.

What disambiguated was the per-task fanout distribution sliced by task size:

Task size bucket

Volume share

Median fanout (pre / post)

P95 fanout (pre / post)

Under 8K tokens

~93%

2.8 / 4.1

5 / 7

8K–80K tokens

~0.4%

6.2 / 9.0

14 / 22

Over 280K tokens

~6.8%

11.4 / 23.7

27 / 61

The verify-pass had landed cleanly on the short-tail task distribution. On the long tail it had not. Verify-pass on a 280K-token task was re-loading large file sets, re-evaluating the patch against those files, and on a non-trivial subset entering a re-plan loop with the primary fix-attempt. The 7% of tasks that already accounted for roughly a third of total cost were now accounting for closer to half. The mean dashboard was muted because volume share on the long tail was unchanged. The cost share was not.

The fix capped verify-pass token budget at 1.3x the primary-fix budget for that task. On cap-hit, the system surfaced "verify could not complete inside budget" to the user and recorded the partial result. Within 36 hours p95 fanout settled back to 4.5 calls per task and monthly run-rate trajectory dropped from a projected ~$87K back toward $61K, consistent with steady-state volume growth.

One side-finding did not fit the cost-cap narrative. Once the team had per-task fanout instrumented, they could see that the primary fix pass had also been over-budget on a smaller subset, roughly 1.5% of volume, but those tasks were producing the highest user-satisfaction scores. Fanout there was paying for itself. Capping fanout everywhere would have cost the team its best results. The lesson was not "cap fanout." It was "cap fanout where it does not buy quality, and instrument well enough to tell the difference."

The dashboard lesson is the one worth printing. Total requests, p99 latency, pass rate, cache hit rate all looked fine. The metric that moved was a percentile of a distribution the dashboard was not aggregating. The first thing to add to any agentic-workload dashboard is the per-task fanout distribution sliced by task size, not the mean.

Code generation and coding agents

Code generation is the most visible agentic workload, and it illustrates all the economic pressures above in concentrated form.

A coding agent processes 200 tasks per day. Each task is a GitHub issue. The agent reads the issue, explores the codebase, plans a fix, implements changes, runs tests, and submits a pull request. The average task takes 23 model calls. But the distribution is bimodal: simple tasks average 8 calls; complex tasks average 65 calls. The team reports a 72% acceptance rate. The cost per accepted task varies by 40x between the simplest and most complex quartile.

Code generation and coding agents combine three economic pressures from Part 2:

1. Large context, growing context. Coding agents need file contents, test outputs, error messages, and prior conversation in context. A session can grow from 15K tokens at the start to 150K tokens before compaction. Compaction resets context but may lose information that causes repair failures downstream.

2. High cache opportunity, fragile cache. System prompts, tool definitions, and project-level context are stable across turns—ideal for prefix caching. But tool outputs (file contents, test results, error traces) mutate the conversation, and any change that breaks the cached prefix forces a full re-prefill of the subsequent content. Cache-safe prompt design means placing stable content before dynamic content and making tool outputs append-only where possible.

3. Variable fanout. Sub-agent calls (for file search, test execution, linting) multiply model calls. A simple rename task might use one sub-agent. A complex refactor might spawn 8 sub-agents, each making 5-15 model calls with their own context windows.

A turn cap controls worst-case cost but not accepted-task rate. If the cap is too aggressive, complex tasks fail, repair rate increases, and cost per accepted task rises because the successes do not amortize the failures. The better control is a cost budget per task with an escalation path: if the agent exceeds the budget, it surfaces the partial result for human decision rather than continuing to spend.

Track the full agent trace:

Component

What to track

Why it matters

Main agent turns

Count and context length per turn

Context growth drives input cost

Sub-agent launches

Count, model, token volume

Sub-agents can dominate total cost

Tool calls

Count, class, output size

Tool output enters context

Compaction events

Before/after tokens, cache impact

Compaction is a cost/quality tradeoff

Test execution

Count, pass/fail, token cost of error parsing

Failed tests trigger repair loops

Repair loops

Count, additional tokens, success rate

Repair cost often exceeds initial attempt

Final outcome

Accepted, repaired-accepted, rejected, abandoned

The denominator

The cost per accepted task for coding agents can be modeled as (input_tokens × input_rate + output_tokens × output_rate) × (1 + repair_rate) / accept_rate, where token counts are summed across the full trajectory (initial fix, verify pass, any re-plan loop) and the (1 + repair_rate) factor accounts for trajectories that complete but require a follow-up fix.

Decision rule

For agentic workloads, the economic unit is the accepted task, not the request. Measure and report: fanout multiplier, token fanout multiplier, cache hit rate across the trajectory, compaction frequency, repair rate, and cost per accepted task. If the fanout multiplier exceeds your budget model, the problem is architecture (sub-agent design, tool output management, termination conditions), not token price.

For coding agents specifically, measure fanout multiplier, context growth curve, cache hit rate, and accepted-task rate. The cheapest model is the one that minimizes cost per accepted task, not cost per token. If two models have the same per-token price but different acceptance rates, the one with higher acceptance wins every time.

What to measure

  • Fanout multiplier distribution (calls per task)

  • Token fanout multiplier distribution

  • Context growth curve per task (input tokens by call index)

  • Cache hit rate by call index within a task

  • Compaction events and post-compaction context size

  • Tool output tokens added to context per call

  • Repair rate and repair cost

  • Accepted task rate

  • Cost per accepted task

  • Task complexity distribution (simple, medium, complex)

  • Fanout multiplier by complexity tier

  • Test pass rate on first attempt

  • Repair loop count and repair success rate

  • Accepted task rate by complexity tier

  • Cost per accepted task by complexity tier

Where this breaks

This model breaks when agent tasks vary by three orders of magnitude in difficulty. A “rename this variable” task and a “refactor the authentication system” task are both coding-agent work, but their fanout, context growth, and failure modes are different enough to need separate workload identity rows. Split agentic workloads by task complexity tier if the variance is too high for a single cost model.

For coding agents, the model also breaks when tasks have no clear acceptance criterion. “Improve the codebase” is not a measurable task. Without a defined acceptance criterion (tests pass, linter clean, PR approved, issue closed), the denominator is undefined and cost per accepted task is meaningless. Define the acceptance criterion before optimizing.

For background agents with no human in the loop, the speed premium on sub-agent routing largely disappears. If the orchestrating agent runs overnight, the latency difference between a frontier model and a specialist matters less than the cost difference. The relevant metric shifts from time-to-first-token to cost per accepted result.

Sub-agent specialization requires the subtask to be deterministically verifiable. Most agent subtasks—planning, reasoning, ambiguous tool selection—are not. Treating this pattern as general-purpose will produce models that are confidently wrong on tasks outside the training distribution.

Calculator hook

agentic variant: task-level aggregation, fanout multiplier, cache hit rate by turn, compaction cost, repair rate, accepted task rate. Coding-specific fields: sub-agent count, tool output volume, test execution count, repair loop count, compaction events, acceptance rate by complexity tier. Sensitivity: fanout multiplier, cache hit rate, repair rate.


Chapter 13: RAG And Document Extraction

The field problem

A retrieval-augmented generation pipeline retrieves 8-12 document chunks, constructs a prompt with the retrieved context, and generates an answer with citations. The pipeline works. Then the team adds longer documents, increases the chunk count for better recall, and extends the context window to 32K tokens. The latency increases. The cost increases. The quality does not improve proportionally because more retrieved context means more noise, more distractor passages, and more opportunities for the model to hallucinate a plausible-sounding answer from an irrelevant chunk.

The mechanism

RAG economics have three layers that interact:

1. Retrieval cost. Embedding the query, searching the vector store, reranking candidates. This is usually cheap per query but scales with corpus size, index freshness, and reranking depth. Cohere’s rerank pricing uses search units (one query with up to 100 documents), not tokens—a different billing grammar entirely.

2. Generation cost. The retrieved chunks become input tokens. More chunks mean more input tokens. The relationship is direct: double the retrieved context, roughly double the input cost. But the generation quality curve is not linear. Past a saturation point, additional retrieved context adds noise without improving answer quality.

3. Acceptance cost. RAG answers need grounding checks. Did the answer use the retrieved passages? Did it hallucinate facts not in the retrieved set? Did it cite the correct passages? Schema validation, field-level matching, and grounding audits have their own cost.

In an enterprise RAG system, data residency and multi-tenancy requirements eliminated the cheapest and fastest vector database options before any performance comparison. The latency target was 10-15 seconds end-to-end, not because the team was slow, but because retrieval, generation, and compliance checks compound. Security constraints shrink the feasible set before optimization begins.

The naive answer

“Use a bigger context window and retrieve more chunks.”

More chunks cost more tokens. More tokens do not guarantee better answers. And the latency penalty of longer prefill grows quadratically with sequence length for dense attention models, though many modern architectures use approximations that soften this. The economics of context length were covered in Part 2 Chapter 6.

The better model

Optimize the retrieval-generation boundary, not just the model.

The retrieval quality gate matters more than the generation cost. A pipeline that retrieves the right 4 chunks and generates from 4K context tokens outperforms a pipeline that retrieves 12 chunks with 3 irrelevant distractors and generates from 16K context tokens—at lower cost and lower latency.

In an enterprise RAG deployment, framework orchestration abstractions accounted for 60% of agent response latency. Replacing the hot path dropped response time from 8 seconds to 3 seconds. Framework overhead is not free, and it compounds across every request. Measure it before assuming it is negligible.

Document extraction variant

Document extraction is RAG’s sibling with different economics. Instead of retrieving from a corpus and generating an answer, extraction takes a single document (or document set) and produces structured output: fields, tables, classifications, summaries.

The economics differ because:

  • Input is one document, not retrieved chunks. The document length is the input length.

  • Output is structured. Schema validation is cheap and deterministic.

  • Batch processing is usually feasible. No human is waiting.

  • Quality is measured by field-level accuracy, not answer groundedness.

Extraction workloads are often batch-eligible. The roughly 50% batch discount available from OpenAI, Anthropic, Google, and several serverless open-model providers (see pricing snapshot 2026-05-12) is real savings when latency is not a constraint. Eligibility, stacking with other discounts, and supported models vary by provider.

The other extraction cost driver is retry, and retries hide. A mid-stage legal-tech company ran a backfill: 2.1 million historical contract documents, average 8,400 tokens each, English with a long tail of Spanish-language Latin-American jurisdiction documents. Project plan was 6 weeks at $11-13K. Week 9, $34K spent, 58% complete. The program manager flagged it.

The debug path was a punch-list, not a hypothesis chain. The team started with the obvious lever—eval pass cost—and ruled it out at about 12% of spend. They scanned the long-document tail next; outliers existed but were too rare to dominate. They sanity-checked model selection (a Llama-class 70B priced reasonably) and confirmed it was not mispriced for the workload. Only then did they pull request volume by week. Volume had doubled in week 4 with no traffic-mix change, and about half of it was flagged retry. Grouping retry reasons surfaced the actual driver: 47% were JSON-validation failures on a single field, governing_law, where the model emitted jurisdiction strings inconsistently for Spanish-language documents.

The schema required governing_law as a controlled vocabulary of about 80 jurisdictions. For common US/UK jurisdictions the model handled it cleanly. For Latin-American documents it emitted "México" / "Mexico" / "MEXICO" / "Mexican Federal" / "United Mexican States" interchangeably, all valid variants of the same jurisdiction. JSON validation rejected anything not in the vocabulary. Each retry re-paid the full input cost—roughly 8,400 input tokens plus ~500 output tokens at $0.60/$1.70 per MTok, about $0.006 per retry—for what was effectively a sub-cent string normalization problem at scale.

The fix was a deterministic post-process: a string-to-canonical jurisdiction lookup running before the JSON validation gate. Retries dropped from 47% to 4.1%. Throughput tripled. The remaining 42% of corpus completed in 11 days at about $4,800. Total backfill cost landed at $39K against the $13K projection, but the steady-state run rate was 70% cheaper than the worst-week run rate. The regex-plus-locale-aware-tokenizer layer cost on the order of microseconds per document and orders of magnitude less than the LLM call it gated.

The side-finding the team did not initially see: about 6% of the retry traffic was not jurisdiction normalization at all. It was a different schema failure on effective_date for documents using non-Gregorian calendar references. That failure was carved out as a known-issue for a later remediation. The retrospective initially framed the incident as "we lost a month to schema rigidity." The side-finding reframed it: "every controlled-vocabulary schema must be paired with deterministic normalization before validation gates." That generalization mattered more than the specific fix.

Decision rule

For RAG: optimize retrieval precision before increasing context length. Measure cost per grounded answer, not cost per generation. For extraction: test batch eligibility first—if the workload tolerates async processing, the batch discount is the largest single cost lever.

What to measure

  • Retrieval precision and recall at the chunk level

  • Retrieved context token volume vs generation quality

  • Grounding check pass rate

  • End-to-end latency decomposition (retrieval, generation, validation)

  • Framework overhead as a share of end-to-end latency

  • Batch-eligible share of extraction volume

  • Cost per grounded answer (RAG) or per validated extraction (extraction)

Where this breaks

RAG economics change when the pipeline becomes agentic. A multi-hop RAG system that retrieves, reasons, retrieves again, and synthesizes is an agent, not a single-turn RAG pipeline. Apply the agentic economics model from Chapter 12 once the retrieval loop exceeds one round trip.

Calculator hook

RAG variant: retrieval cost (embedding + search + rerank), generation cost (retrieved tokens as input), grounding check cost, batch discount for extraction. Sensitivity: chunk count, retrieval precision, batch eligibility.


Chapter 14: Offline, Voice, and Specialized Workloads

The field problem

A real-time conversational system (a voice agent in customer service, a clinical scheduling line, an in-vehicle assistant) runs against a sub-2-second end-to-end response budget measured from the moment the speaker stops talking. The budget gets carved up across components before any inference begins: roughly 200-400ms to ASR, around 50ms to intent or routing, 200-400ms to TTS, and 100-200ms to network and orchestration. What is left over, typically 300-800ms, is the LLM budget. There is no room for retry, fallback, or queue delay inside that budget. Either the path through the stack stays inside, or the system misses the user's expected response window.

The mechanism

Voice and real-time workloads have economics that differ from text in three ways:

1. Billing units change. Some voice APIs bill by audio minute, not by token. OpenAI’s Realtime API bills audio input at $32/MTok and audio output at $64/MTok, while their Whisper model bills at $0.017/minute. Google’s TTS bills audio output at $20/MTok and states 25 tokens per second for audio. Together’s Whisper billing is $0.0015/audio minute for standard and $0.27/minute for streaming. The billing grammar from the provider pricing research applies: know whether you are paying per token, per minute, or per session.

2. Latency is a hard constraint, not a target. In text chat, a slow response is annoying. In voice, a slow response breaks the conversation. Users abandon or talk over the system. The latency SLO is a ceiling, and the inference budget is whatever is left after ASR, TTS, network, and orchestration consume their share.

3. Fallback changes the economic structure. When the LLM path fails or exceeds the latency budget, the system falls back to rule-based responses for common intents.

In a production voice system, fallback rate was below 1% under normal load and reached 15-20% during incidents. The fallback path—rule-based responses instead of LLM generation—had lower inference cost but higher customer escalation risk. Fallback is an economic mechanism: it shifts cost from inference to support burden. The trace must capture both paths.

The naive answer

“Use a faster model.”

A faster model helps if the model is the bottleneck. If the bottleneck is ASR, TTS, network, or orchestration, a faster model does not recover the latency budget. Decompose the end-to-end latency before optimizing any single component.

The better model

Model the latency budget as a waterfall:

The LLM budget is the residual. It determines which models are feasible, which quantization levels are required, and whether dedicated capacity is needed to guarantee TTFT.

The cost model for voice includes both paths:

Embeddings, Batch, Evals, and Offline Workloads

Three offline inference workloads share the same standard API: embedding generation for a RAG corpus (2M documents), nightly evaluation of model quality (5,000 test cases), and weekly batch extraction from customer support transcripts (50,000 documents). All three pay real-time prices with real-time latency guarantees they do not need.

Offline workloads are defined by one property: no human is waiting for the response. This changes everything about the economics:

1. Batch APIs offer ~50% discounts. OpenAI, Anthropic, and Google offer batch processing at approximately 50% of standard rates for select models. Fireworks and Groq offer similar batch discounts on supported models. The trade is latency for cost: batch jobs complete within hours to days rather than in real time. Completion windows, eligible models, and discount stacking rules vary by provider—check the current pricing page before assuming 50% applies to your model.

2. Embedding billing is different from generation billing. Embedding models bill per input token with no output token charge (the output is a vector, not text). Fireworks prices embedding by base model parameter count: up to 150M parameters at $0.008/MTok, 150-350M at $0.016/MTok. Together prices at $0.02/MTok. The billing unit is input-only.

3. Eval workloads have compounding cost. A quality eval that runs 5,000 test cases through a model-based grader is itself an inference workload. If the grader is the same model being evaluated, the eval cost can approach the production cost. Deterministic checks (schema validation, exact match, regex) cost nothing at the model API level. Use them first.

4. Rate limits may not bind. Some batch APIs do not count against standard rate limits (Groq is one example). This means offline workloads may coexist with production traffic without competing for quota. Check provider documentation—not all batch surfaces have separate rate limits.

The naive answer

“Offline work is cheap—just run it.”

Offline work is cheaper per token, but volume can dominate. Embedding 2M documents at 500 tokens per document is 1 billion input tokens. At $0.02/MTok, that is $20. At $2/MTok on a frontier generation model, that would be $2,000. The billing unit and model choice matter more than the batch discount for high-volume offline work.

The better model

Separate offline workloads by billing grammar:

Workload

Billing unit

Batch eligible

Volume driver

Cost lever

Embedding generation

Input tokens only

Yes

Corpus size × avg doc tokens

Model size, quantization

Eval grading

Input + output tokens

Yes

Test cases × (input + grader output)

Deterministic checks first

Batch extraction

Input + output tokens

Yes

Document count × avg doc tokens

Batch discount, output control

Backfill/migration

Input + output tokens

Yes

Historical records × schema

Batch discount, parallelism

Reranking

Search units (query + docs)

Depends on provider

Query count × docs per query

Rerank depth, doc chunking

For eval workloads specifically, the cost model should separate:

Run deterministic checks first. Only send to model grading what passes deterministic gates. Only send to human review what the model grader flags as ambiguous or low-confidence.

Decision rule

For voice workloads, the serving physics constraint (Part 2) is primary. Decompose end-to-end latency into component budgets. The model choice is constrained by the residual LLM budget, not by token price alone. Include fallback rate and escalation cost in the economic model.

For offline workloads, use batch APIs when available and the completion window is acceptable. Use embedding-specific models for embedding work—do not generate embeddings with frontier generation models. Run deterministic eval checks before model-based grading. Separate offline workload cost from production workload cost in reporting.

What to measure

  • End-to-end latency waterfall (ASR, intent, LLM, TTS, network, orchestration)

  • LLM budget utilization (actual vs allocated)

  • TTFT and TPS under production concurrency

  • Fallback activation rate and escalation rate from fallback path

  • Cost per completed interaction (both LLM and fallback paths)

  • Thermal throttling events (if edge-deployed)

  • Batch vs real-time share of total inference spend

  • Batch completion time distribution

  • Embedding corpus size and refresh frequency

  • Eval cost as a percentage of production inference cost

  • Deterministic check coverage (share of evals that do not need model grading)

Where this breaks
  • Voice economics change when the system becomes multi-modal. An agent that handles voice input, visual menus, receipt images, and audio responses has billing units across text, audio, and image tokens simultaneously.

  • Batch economics break when “offline” becomes “near-real-time.” A system that needs extraction results within 15 minutes is not batch-eligible for a 24-hour batch API. Check the provider’s batch completion window against your actual latency requirement.

  • Batch and cache discounts may not stack. Groq explicitly states that prompt caching does not stack with batch discount. Check per-provider stacking rules.

Calculator hook

voice variant: latency budget waterfall, LLM residual budget, fallback rate, escalation cost, multi-modal billing lines. Batch variant: batch multiplier, embedding-only billing, eval cost model (deterministic + model + human), volume-based costing. Sensitivity: fallback rate, LLM budget, batch eligibility share, corpus size.


Part 3 Summary

Core vocabulary from Part 3:

Concept

Where defined

What it does

Workload class

Chapter 10

Separates workloads that need different treatment

Workload identity schema

Chapter 10

Canonical fields that describe a workload’s needs

Session cost

Chapter 11

Models context accumulation and quality gate cost

Task lifecycle cost

Chapter 12

Models agentic fanout, repair, and acceptance

Fanout multiplier

Chapter 12

Exposes hidden model calls per user task

Retrieval-generation boundary

Chapter 13

Optimizes RAG cost at the right layer

Latency budget waterfall

Chapter 14

Decomposes real-time constraints by component

Billing grammar

Chapter 14

Matches billing unit to workload type


Evidence Notes for Part 3

#

Claim

Label

Source

Chapter

1

Provider batch discounts are approximately 50%

PUBLIC

OpenAI, Anthropic, Google, Fireworks, Groq official pricing pages, accessed 2026-05-12

14

2

Cohere rerank uses search units, not tokens

PUBLIC

Cohere pricing page, accessed 2026-05-12

13

3

Enterprise RAG data residency eliminates vector DB options

REPORTED

Operator Proof Box 7, sohailmo.ai/rag-infrastructure-pgvector/

13

4

Framework orchestration was 60% of RAG latency

REPORTED

Operator Proof Box 11, sohailmo.ai/ray-production-lessons/

13

5

Voice fallback rate 1% normal, 15-20% incidents

REPORTED

Operator Proof Box 12, sohailmo.ai/vllm-production-scale-lessons/

14

6

Agents use ~4x chat tokens, multi-agent ~15x

REPORTED

Anthropic engineering post on multi-agent research system, 2025-06-13

12

7

Token usage explained 80% of BrowseComp performance variance

REPORTED

Anthropic engineering post on multi-agent research system, 2025-06-13

12

8

Claude Code built around prompt caching, prefix stability matters

PUBLIC

Anthropic engineering post, 2026-04-30

12

9

OpenAI Realtime audio pricing: $32/MTok input, $64/MTok output

PUBLIC

OpenAI pricing page, accessed 2026-05-12

14

10

OpenAI Whisper: $0.017/audio minute

PUBLIC

OpenAI pricing page, accessed 2026-05-12

14

11

Google TTS: $20/MTok audio output, 25 tokens/sec

PUBLIC

Google Gemini pricing page, accessed 2026-05-12

14

12

Together Whisper: $0.0015/min standard, $0.27/min streaming

PUBLIC

Together AI pricing page, accessed 2026-05-12

14

13

Fireworks embedding pricing by parameter tier

PUBLIC

Fireworks pricing page, accessed 2026-05-12

14

14

Together embedding pricing: $0.02/MTok

PUBLIC

Together AI pricing page, accessed 2026-05-12

14

15

Groq: prompt caching does not stack with batch discount

PUBLIC

Groq prompt caching docs, accessed 2026-05-12

14

16

Groq batch: no impact to standard rate limits

PUBLIC

Groq pricing page, accessed 2026-05-12

14

17

Workload class definitions and splits

DERIVED

Standard workload classification applied to inference economics

10

18

Session cost model for support

DERIVED

From Part 1 LCPR, Part 2 cache break-even

11

19

Task lifecycle cost formula

DERIVED

From agentic economics research, report 04

12

20

Fanout multiplier definition

DERIVED

From agentic economics research, report 04

12

21

RAG cost decomposition

DERIVED

From Parts 1-2 concepts applied to retrieval-generation pipeline

13

22

Latency budget waterfall

DERIVED

Standard latency decomposition applied to voice systems

14

23

All worked example numbers

SYNTHETIC

Illustrative, shaped by real billing semantics

11, 12

24

Agentic tasks consume ~1,000x more tokens than single-turn chat

MEASURED

arXiv 2604.22750, “How Do AI Agents Spend Your Money?”, 2026-04

12

25

Answer-vs-agentic inference constraint framing

OPINION

Thompson, “The Inference Shift,” Stratechery, 2026-05-11

10