Opener

The Deterministic Gate

A mid-market regional bank runs an internal answer-drafting workload for its loan officers. The model takes a loan application packet, retrieves the matching policy excerpts and the customer's account context, and drafts a one-paragraph response the officer reviews before sending. About 30,247 queries flow through it on a typical weekday. Inputs run roughly 3,200 tokens (the policy boilerplate plus the application context). Outputs run roughly 280 tokens.

The workload sits behind a three-gate validation pipeline: a PII redaction pass, a policy-rubric eval, and a compliance regex sweep that catches anything resembling unhedged numeric advice. The model is a frontier closed-API at $3.00 per million input tokens and $15.00 per million output. The regulatory boilerplate is stable across queries, so the team designed the prompt with the boilerplate at the top and the application context after. Cache hits on the boilerplate prefix run around 75%.

In Q3, finance flagged that inference spend had grown 22.4% while query volume grew only 8.1%. Quality pass rate was up slightly. P95 latency was flat. The team's first instinct was traffic mix.

This chapter is the autopsy. It shows what the cost-per-token view missed, why a deterministic pre-validation gate ended up cutting inference spend 14% without touching the model, and the side-finding the CFO ended up caring about more than the savings: a defensible audit trail for the bank's compliance review.

The Trace

Below are thirteen consecutive requests pulled from the answer-drafting service on a Wednesday in Q3. The numbers are anonymized; the field shapes, billing grammar, and cache semantics come from real provider documentation and the failure modes come from production. This slice is built to make the mechanism legible, not to publish a deployment. The headline 10x loaded-to-naive ratio it produces is unusually wide on purpose. Production answer-drafting workloads more commonly land at a 2-4x loaded ratio; the wide gap here is rhetorical, the shape is structural.

#	Type	Input tok	Cached	Output tok	TTFT ms	E2E ms	Eval	Notes
1	First attempt	3,184	2,388	264	412	2,180	Pass	Cache hit on policy prefix
2	First attempt	3,247	0	308	932	3,841	Pass	Cache miss, new application packet
3	First attempt	3,108	2,331	282	388	2,418	Fail	Compliance regex flagged unhedged numeric phrasing
4	First attempt	3,402	0	396	1,162	4,728	Fail	TTFT exceeded 850ms SLO
5	First attempt	3,072	2,304	247	376	2,062	Pass	—
6	First attempt	3,816	0	374	1,408	4,892	Pass	Long retrieval context, cache miss
7	First attempt	3,196	2,397	271	392	2,234	Pass	—
8	First attempt	3,288	0	442	1,047	5,316	Fail	Output too long, E2E violated 5s SLO
9	Retry of #4	3,402	0	287	854	3,247	Pass	Same input, no cache (TTL expired)
10	Retry of #8	3,288	0	312	736	3,108	Pass	Constrained decoding capped output length
11	Eval grader	1,247	0	87	218	814	—	Grading requests 3 and 8
12	Repair of #3	3,847	0	318	672	3,247	Pass	Regenerated with hedged phrasing
13	First attempt	3,144	2,358	258	384	2,124	Pass	Cache hit

Thirteen requests. Nine first attempts, two retries, one eval grader call, one repair. The team expected nine requests to serve nine officer queries; the system generated thirteen. The nine first attempts shown here are representative of the daily mix. Scaled to roughly 30,247 first-attempt queries per workday with similar shapes, the daily inference bill works out to about $503.

What The Token Price Missed

Cache hit rate was worse than the cost model assumed

The team designed the prompt with the regulatory boilerplate and tool definitions at the top (roughly 2,400 tokens of cache-eligible prefix). The cost model assumed an 88% hit rate based on the boilerplate's stability. In production, the measured hit rate was 74.6%. Three reasons the team did not anticipate:

The provider's cache TTL is 5 minutes per replica. Branch officers cluster their drafting work into morning and afternoon runs, so the inter-request gap during off-peak windows exceeded TTL.
The retrieval context (matched policy excerpts, account snapshot) changes per query. The team had originally placed it between the system prompt and the user turn, breaking the prefix match for every new document set.
The provider's load balancer distributes across more replicas during traffic spikes. Cache is per-replica. Fewer queries land on the same replica within the TTL window.

The pricing page advertised cache reads at 90% off uncached input. That discount matters only when cache hits actually happen. At 74.6% hit rate instead of the modeled 88%, the effective input cost is 23% higher than the spreadsheet assumed.

Output tokens drift up under compliance pressure

The compliance regex sweep flags any sentence that asserts a numeric claim without a hedge ("approximately," "subject to underwriter review," "see disclosure"). When the model gets the hedge wrong on a first attempt, the repair pass tends to produce longer outputs because hedged paragraphs run longer than direct ones. Requests 4, 6, and 8 in the trace show output lengths of 396, 374, and 442 tokens, all well above the 280-token average the team modeled. Output tokens are the expensive side of the bill. At $15 per million versus $3 for input, the ~44% drift in output length on the requests that hit the compliance pass erases a 90% input cache discount on roughly two-thirds of requests.

This is not a bug. The compliance regex was tuned to be sensitive, and hedged language is verbose. Prompt-and-policy interaction is not free.

Retries and repairs are invisible on the pricing page

Two of nine first attempts failed their SLO (requests 4 and 8). Both were retried. One first attempt failed the compliance gate (request 3) and was repaired with hedged phrasing. The eval grader call (request 11) scored the failed outputs against the rubric. Across the day, 14.1% of first attempts triggered a retry and 8.4% triggered a repair. These four additional requests in the trace (retries, eval, repair) are not in the pricing comparison. They are in the bill. Each one consumed input tokens, output tokens, compute time, and latency budget. The recovery requests added 27.4% to the daily inference total.

Quality failures have downstream costs

Request 3 failed the compliance gate; its draft asserted an APR without the required hedging language. The repair call (request 12) fixed the phrasing, but the loan officer waited an extra 3.2 seconds for the corrected draft. When the repair also fails, the workflow escalates to a senior compliance reviewer.

About 5.2% of queries escalate to a human compliance reviewer. At roughly $2.40 in reviewer time per escalation, 1,573 daily escalations cost about $3,775, roughly 7.5x the $503 daily inference bill. The pricing page does not have a column for human escalation cost per failed answer.

The SLO gate changes the denominator

The team's SLO is: TTFT under 850ms, E2E under 5 seconds, eval score above 0.85, and a compliance regex pass. Of the nine first attempts in this trace, six cleared all four gates on the first try. Three failed at least one gate; two were recovered by retry and one by repair.

The denominator is not "requests sent to the provider." It is "accepted answers that passed quality, latency, and compliance gates." Nine first-attempt queries produced nine accepted answers, but six were accepted on the first try and three needed recovery. The inference cost per accepted answer is higher than the cost per request, and the gap is exactly what the pricing comparison missed.

The Naive Calculation

The procurement spreadsheet did this: $503 daily inference / 30,247 queries = $0.0166 per query. The calculation treats every request as equally valuable and ignores retries, repairs, eval grader calls, human escalation, and operational overhead.

The Correct Calculation

Cost component	Daily amount
Trace-derived inference cost	$503
Invoice delta (rounding, timing, batch pricing)	$13
Eval grader cost (LLM-as-judge calls scoring failed first-attempts)	$18
Human compliance escalation (1,573 cases × $2.40)	$3,775
Operations and audit overhead allocation	$978
Total loaded cost	$5,287

(Provider invoice for the same period: $516. The $13 delta reconciles trace-derived cost to invoice. See Part 1, Chapter 3 for the reconciliation method.)

Denominator	Value
Total queries attempted	30,247
First-attempt pass	23,341
Recovered by retry or repair	5,333
Accepted answers	28,674
Escalated to human compliance reviewer	1,573

Loaded cost per accepted result: $5,287 / 28,674 = $0.184. That is roughly 10x the naive per-query inference cost. Inference is $503 per day. The denominator shrank from 30,247 to 28,674, the numerator picked up retries, repairs, eval, compliance escalation, and operations overhead, and the non-inference costs ended up dominating the loaded number.

The trace and totals above describe a single day in early Q3, three weeks after the deterministic pre-validation gate shipped. Before the gate, the same workload ran at $0.213 LCPR — a 14% higher loaded cost, driven by a higher first-attempt compliance-fail rate. The numbers below are the post-gate steady state; the gap to the pricing page is what they look like even after a successful intervention.

That is the deterministic-gate finding in miniature. The inference bill is $503. The loaded cost of producing accepted work is $5,287. The gap, made of retries, repairs, quality gates, human escalation, and operational overhead, is the part that does not show up on the pricing page. The token-price comparison captured about 10% of the actual cost structure.

The gate itself did one thing. It ran the compliance regex on the prompt's input context, before the model call, and rewrote any obviously unhedged numeric assertion in the user-supplied draft language. That single change pulled the first-attempt compliance-fail rate from 14.1% down to 7.8%, which cut inference cost 14% and produced the LCPR drop from $0.213 to $0.184 shown above. The CFO signed off, then asked a different question: "Can you generate a regulatory attestation showing every query passed the same pre-validation step?" The audit trail became the thing the bank cared about. The cost savings funded the project; the deterministic gate justified it to the compliance committee.

Why This Happens

The bank's procurement spreadsheet wasn't unusually careless. It was the standard tool, used the standard way. The same pattern shows up whenever a team compares inference options using token price alone:

Token price captures only the inference provider bill. It misses eval cost, retry cost, repair cost, human escalation, operational overhead, and engineering time. On workloads where quality is variable — compliance answer-drafting, contract extraction, anything with a regulatory checkpoint — the non-token costs can dominate.

Dividing by total requests, total tokens, or total tickets conflates attempted work with accepted work. If 18% of attempts fail quality, latency, or compliance gates, the cost per accepted unit is 22% higher than the cost per attempt. The bank's loaded cost climbed not because each query got more expensive but because the gate threw a higher share of them back.

Caching is a third trap. Prompt-cache economics depend on prefix stability, TTL, inter-request timing, and routing topology. A cache discount on the pricing page is not a cache discount in production unless the cache hits — and the cache rarely hits when the prefix carries a customer-specific ID, a timestamp, or a regulatory clause that varies across requests.

Output length compounds it. Different models produce different output distributions for the same prompt. Output tokens are typically 2-6× more expensive than input tokens. A model that produces 40% more output can cost more despite a lower per-token price — and longer outputs also raise the chance of a compliance-fail somewhere in the answer, which raises the retry rate, which raises the bill again.

The last trap is quality itself. A model with a lower first-attempt pass rate generates more retries, repairs, and human escalations. Each retry is an inference call that does not appear in the pricing comparison. Each escalation is a cost that does not appear on the inference bill at all.

The Book’s Promise

This manual teaches one skill: how to measure, model, and operate production inference decisions so you can choose the cheapest reliable architecture that still meets quality, latency, reliability, and customer requirements.

The economic unit is not the token. It is not the request. It is not the GPU-hour. It is the accepted work unit: one successful task completion that passes your quality gate, meets your latency SLO, complies with your data constraints, and is worth paying for.

Every chapter in this book follows the same rhythm:

A wrong conclusion that a competent team might reach from a naive view.
The mechanism that makes the conclusion wrong.
A trace or worked example that shows the gap.
A formula with units.
A decision rule with caveats.
A calculator view or artifact the reader can use in a meeting.
The conditions under which the rule breaks.

If you already have clean traces, good evals, and invoice reconciliation, start at Part 2 (Serving Physics) or Part 3 (Workload Economics). If you do not yet have these, start with Part 1 (The Economic Unit) and build them.

The next chapter defines the loaded cost per result—LCPR—and explains why every number in this book is normalized to accepted output, not raw tokens.

The regional bank in this chapter is anonymized; the workload shape, failure mode, and resolution are drawn from production answer-drafting deployments. The numeric trace and per-query rates use real provider billing semantics and real cache implementation behavior. No numbers should be attributed to any specific employer, customer, or deployment.