Parts 1 through 4 chose a route. Part 5 keeps it honest. The migration happened, the endpoint is live, traffic is flowing. The decision can drift in three ways: it can stop being correct, you can fail to detect that it stopped, or you can detect it but be unable to explain the drift to the engineer, product manager, and finance partner who all have to act on it.
Operating inference is measurement, not faith. The workload shape shifts, the provider's pricing changes, the model's post-update behavior moves, and the quality floor drifts in the background. The eval suite covers last quarter's failure modes; this quarter has new ones. Finance sees an invoice moving and asks why, and the engineering team, which measures tokens, cannot answer in dollars.
Part 5 supplies the operating cadence: measurement, evaluation, honest benchmarking, trace-to-invoice reconciliation, incident response, review discipline, and forecasting.
Chapter 23: Baselines And Evals
The field problem
Two weeks after migrating a support chat workload to a new model, the product manager asks: “Is it better?” The engineer shows a chart of p99 latency. The product manager asks about quality. The engineer shows the eval pass rate. The finance partner asks about cost per resolution. The engineer shows the inference bill.
Nobody can answer the question because the team did not establish a baseline before the migration. They have measurements on the new route but nothing rigorous on the old route to compare against. Every comparison is against memory: “it felt faster,” “the old model was better at edge cases,” “the bill was about $30K.”
The mechanism
A baseline is a measured snapshot of LCPR, quality, latency, and operational health for a workload at a specific point in time, on a specific route, under specific traffic. It is not a benchmark. It is a photograph.
The minimum viable baseline:
Dimension |
Metric |
How to capture |
|---|---|---|
Cost |
LCPR per accepted output |
Trace-derived, joined to invoice |
Quality |
Eval pass rate on production-representative set |
Run evals against production traffic sample |
Latency |
p50, p95, p99 TTFT and TPOT |
From serving traces |
Reliability |
Error rate, retry rate, timeout rate |
From serving traces |
Cache |
Cache hit rate, cache-eligible prefix share |
From provider usage fields |
Acceptance |
Quality gate pass rate, human escalation rate |
From eval and support systems |
Volume |
Requests per day, accepted outputs per day |
From traces |
The naive answer
“We know how the system performs. We’ve been running it for months.”
Knowing is not measuring. Informal knowledge decays, confabulates, and cannot be presented to a finance partner or used to evaluate a migration. The baseline must be written down, dated, and tied to the traffic shape and model version that produced it.
The better model
Capture the baseline before any change. The change can be a migration, a model update, a prompt revision, a cache configuration change, or a traffic shape shift. Without a pre-change baseline, every post-change measurement is against an imagined prior state.
The baseline should include the traffic shape that produced it. A baseline captured during Black Friday traffic is not comparable to a baseline during normal traffic. Record the traffic volume, prompt length distribution, output length distribution, and cache hit rate alongside the cost and quality metrics.
Evals as an economic instrument
The baseline depends on the eval to determine its denominator. An eval set built during initial development tends to over-represent the failure modes the team imagined at design time and under-represent the failure modes production traffic actually produces. A typical development-era eval skews roughly three-quarters toward design-imagined cases, a fraction harvested from incident postmortems, and the remainder synthetic adversarial. The top production failure mode by frequency, often a tool-call output format the original prompt assumed but a newer model emits differently, appears in only a handful of eval cases. The eval pass rate stays comfortably high while the production complaint rate tells a different story. The fix is not to recalibrate the eval; it is to redistribute eval mass against production failure-mode frequency. Rebuilt evals that weight cases by production frequency typically drop the headline pass rate by ten to twenty points, which is what an honest denominator looks like.
Evals are not a quality score. They are a measurement instrument. For economics, the eval serves two functions:
1. Denominator control. The accepted-output count in LCPR depends on the quality gate. If the eval is weak, the denominator is inflated. The system appears cheaper because it “accepts” outputs that should be rejected.
2. Cost-quality tradeoff measurement. When comparing routes or models, the eval determines whether a cheaper model meets the quality floor.
A strong eval set has four properties: coverage (represents current production traffic), adversarial depth (includes known hard cases), grading accuracy (agrees with expert judgment), and refresh cadence (updated from production failures).
Eval economics form a three-layer stack:
Layer 1: Deterministic checks. Schema validation, format compliance, policy filters. Cheap. Fast. Run on 100% of traffic.
Layer 2: Model-graded evaluation. An LLM grades the output against a rubric. Run on a sample or on all traffic. Requires human calibration.
Layer 3: Human review. Domain experts review outputs. Expensive. Essential for calibrating Layers 1 and 2 and discovering new failure modes.
The cost of each layer is part of the LCPR numerator. The quality gate from evals determines the denominator.
Before any route change, model update, or significant prompt revision: capture a baseline. After the change, capture the same metrics and compare. Treat eval coverage as an operational metric. Review it monthly.
What to measure
Baseline metrics: LCPR, quality pass rate, latency percentiles, cache hit rate, volume
Model version, prompt version, eval suite version, traffic shape at baseline time
Eval coverage: share of production failure modes in the eval set
Grader agreement: model-grader vs human-expert
Eval cost per workload: grader calls, model tokens, human review minutes
False positive and false negative rates
Baselines break when captured during an atypical period without noting the anomaly, or when the eval suite changes between baseline and comparison. Evals become theater when the team optimizes the pass rate instead of quality. Regularly adding production cases and rotating adversarial cases resists overfitting.
Calculator hook
The baseline view stores a dated snapshot of all cost components. The comparison view shows delta between baseline and current state. The eval cost layer: deterministic check cost, model-grader cost, human review cost per sampled request.
Chapter 24: Benchmarks And Goodput
The field problem
“Model X achieves 2,500 tokens/sec on A100.” The published benchmark looks decisive. In production, the same model on the same hardware delivers 600 tokens/sec. The support ticket comes back with an explanation: the benchmark used batch 256, synthetic 128-token prompts, uniform 50-token outputs, no quality gate, and closed-loop arrivals. Production runs batch 8-16, variable prompts (500-8,000 tokens), variable outputs (50-2,000 tokens), a quality gate that rejects 15% of outputs, and Poisson arrivals. The benchmark result is not wrong. It is irrelevant.
The mechanism
Benchmarks measure performance under controlled conditions. The conditions determine the result. The gap between benchmark conditions and production conditions is the benchmark-to-production transfer gap.
Common benchmark conditions that do not transfer:
Benchmark condition |
Production reality |
Effect on gap |
|---|---|---|
Fixed batch size (256) |
Dynamic batch (4-32) |
Throughput 3-8x lower |
Uniform short prompts |
Variable-length prompts |
TTFT variance increases |
Uniform short outputs |
Variable-length outputs |
Cost per request varies widely |
No quality gate |
10-20% rejection rate |
Goodput is lower than throughput |
Closed-loop arrivals |
Poisson arrivals |
Queue effects at high load |
No prefix caching |
Mixed cache hit rates |
Cost per input token varies |
No concurrent workloads |
Shared infrastructure |
Interference effects |
No SLO constraints |
TTFT < 500ms, TPOT < 50ms |
Cannot operate at peak batch |
The naive answer
“The benchmark says it’s faster. Let’s switch.”
The benchmark says it is faster under benchmark conditions. The correct question: is it faster, cheaper, and good enough under my conditions? That requires running the benchmark with your traffic shape, your SLO constraints, and your quality gate.
The better model
The goodput frontier test is the book's answer to benchmark hygiene. It replaces single-point throughput numbers with a multi-dimensional evaluation:
1. Define your SLO. TTFT threshold, TPOT threshold, E2E threshold, quality pass rate threshold. These are inputs, not outputs.
2. Run with your traffic. Use your prompt/output length distribution. Use Poisson arrivals at your expected request rate. Include your system prompt, tools, and retrieval context.
3. Sweep the load. Increase request rate until goodput stops improving. The point where goodput plateaus or drops is your capacity ceiling under SLO.
4. Measure goodput, not throughput. Count only requests that meet all SLO thresholds and pass the quality gate. Report cost per accepted result at the operating point, not cost per total request at peak throughput.
5. Compare routes. Run the same test on each candidate route. The route with the lowest cost per accepted result at your operating point wins. Not the route with the highest throughput or the lowest token price.
Never compare routes using vendor benchmarks alone. Run a goodput-bounded load test with your traffic shape, SLO, and quality gate. The results will differ from the vendor numbers; that difference is the benchmark-to-production transfer gap, and it determines whether the vendor result is useful or misleading.
What to measure
Goodput (accepted results/sec) at your target request rate
Cost per accepted result at the operating point
The request rate where goodput stops improving (capacity ceiling)
Benchmark-to-production transfer gap (vendor claimed vs measured)
Quality pass rate under load (does quality degrade at high throughput?)
Goodput testing requires a quality gate, which requires evals. If the eval suite is weak (Chapter 23), the goodput number is inflated. Run goodput tests only with a calibrated eval suite.
Goodput testing also requires representative traffic. If the test traffic does not match production (different prompt lengths, different output lengths, different tool call patterns), the results will not predict production performance.
Calculator hook
The calculator view takes per-request latency, quality, and cost data from a load test. It outputs: goodput curve by request rate, cost per accepted result curve, capacity ceiling under SLO, and comparison across routes.
Chapter 25: Observability And Trace-to-Loaded-Cost
The field problem
The dashboards show p99 latency, error rate, GPU utilization, and request volume. All the standard signals. The finance partner asks: “Why did the inference bill increase 22% this month?” Latency: stable. Errors: stable. Utilization: stable. Volume: up 8%. An 8% volume increase does not explain a 22% cost increase.
The missing signal: average output length increased 35% because a prompt change encouraged more detailed answers. Output tokens typically cost 2-6x input tokens on major API providers (see pricing snapshot 2026-05-12). The volume increase plus the output length increase, compounded by lower cache hit rate on the new prompt, explains the bill movement. But the observability system did not track output length distribution or cache hit rate.
The mechanism
Inference observability for economics requires different signals than inference observability for reliability. Reliability asks: “Is the system up and responsive?” Economics asks: “Why did the cost change, and is the spend producing accepted work?”
The economic signal set:
Signal |
Why it matters for economics |
|---|---|
Input tokens per request (p50, p95) |
Input cost and prefill time driver |
Output tokens per request (p50, p95) |
Output cost (typically 2-6x input) and decode time driver |
Cache hit rate |
Input cost lever; regression = cost spike |
Cache-eligible prefix share |
Upper bound on cache savings |
Quality gate pass rate |
Denominator in LCPR |
Retry rate by cause |
Numerator inflation (retries cost money but don’t produce output) |
Human escalation rate |
Often the largest LCPR component |
Model version |
Model changes can change cost, quality, and latency simultaneously |
Prompt version |
Prompt changes affect output length, cache hit rate, and quality |
Provider/route |
Multi-source routing changes cost profile |
The naive answer
“We have Datadog / Grafana / CloudWatch. Observability is covered.”
Standard APM covers latency and errors. It does not cover output length distribution, cache hit rate, quality gate pass rate, or cost per accepted output. These are inference-specific economic signals that require inference-specific instrumentation.
The better model
Layer economic signals on top of standard observability:
Layer 1: Per-request fields. Every request trace should include: input tokens, output tokens, cached input tokens, cache creation tokens, model version, prompt version, quality gate result, cost estimate. These come from the provider response and the eval system.
Layer 2: Aggregated dashboards. Daily and weekly views of: LCPR by workload, cost breakdown by component (inference, cache, quality, human, ops), output length distribution, cache hit rate trend, quality pass rate trend.
Layer 3: Anomaly detection. Alert when: LCPR increases by more than 10% week-over-week, cache hit rate drops by more than 15 percentage points, output length p95 increases by more than 25%, quality pass rate drops below the quality floor, or human escalation rate spikes.
Trace-to-loaded-cost reconciliation
Observability answers "why did cost change?" Trace-to-loaded-cost reconciliation answers the harder question: "what is the true loaded cost, and does it match what the vendor billed?"
Trace-to-loaded-cost reconciliation joins four data sources. The trace provides per-request cost estimates summed across the period. The invoice shows what the provider actually charged; the delta reveals timing skew, credits, dropped traces, or rate-card drift. The eval supplies the quality gate that determines the accepted denominator. The contract pins the unit price the vendor agreed to charge.
Reconciliation runs in five steps: compute trace-to-invoice delta and investigate anything beyond 5%, add non-inference costs (eval grading, human escalation, ops allocation), calculate the accepted-output denominator, compute LCPR, and roll the loaded cost up against contracted unit rates.
Worked example: mid-cap pharmacy, clinical-prior-auth drafting
A mid-cap pharmacy chain runs a clinical-prior-auth answer-drafting workload. The LLM drafts the medical-necessity narrative that gets attached to payer prior-authorization submissions. Roughly 8,431 queries per day. The bill landed at $14,127 for May; the loaded cost when the ops lead reconciled came in at $19,283. The driver of the gap was schema-drift on payer-specific JSON contracts: each major payer demands a slightly different field shape on the prior-auth response, and three of the seven payers had quietly updated their schemas inside a quarter without versioned notice.
Component |
Value |
Source |
|---|---|---|
Trace-derived inference cost |
$14,127 |
Sum of per-request token cost from serving traces |
Provider invoice |
$15,341 |
Vendor billing system, May statement |
Delta vs. trace |
$1,214 (8.6%, above 5% threshold) |
Investigated — see contract finding below |
Eval grader cost |
$847 |
Model-graded eval over 12% sample, grader model invoice |
Schema-repair retries (payer-specific JSON) |
$1,386 |
Trace tag: 14.3% of requests required a repair pass on schema-drift days |
Human escalation (pharmacist sign-off on rejected drafts) |
$1,162 |
EMR audit log; 581 cases at $2/case fully-loaded |
Ops overhead allocation |
$1,761 |
Finance allocation rule, prorated by workload share |
Total loaded cost |
$19,283 |
Reconciled across trace, invoice, eval, and contract |
Total requests (May) |
261,361 |
Serving trace count |
Accepted outputs (eval-pass + pharmacist-sign-off) |
218,749 |
Joined eval-pass and human-review systems on request_id |
LCPR |
$0.0882 per accepted prior-auth draft |
Total loaded cost ÷ accepted outputs |
The 8.6% trace-to-invoice delta was the thread that unraveled everything. The ops lead pulled the contract rate card and matched it against the invoice line items. Two of the line items were billed at the wrong rate. A tier the pharmacy had never been moved off of after a contract amendment three months earlier. The overcharge worked out to $1,214 per month and had run for two billing cycles before anyone noticed. The reconciliation surfaced a contract-rate mistake on the vendor's side; the credit, when issued, paid for the next month of schema-repair retries.
Use the reconciliation when: the invoice has moved more than 5% without an obvious volume or model change, finance is questioning the unit economics of a workload, or a contract amendment has just gone into effect. Skip it when the invoice has been stable for three consecutive months and the workload shape has not shifted.
Regulatory observability for explanation LLMs
The pharmacy example sat in a regulated workload but stopped short of the deepest cost driver in regulated inference: the audit and model-risk apparatus that wraps an LLM the moment its output becomes part of a decision a regulator can ask you to defend.
When an LLM drafts an explanation that accompanies a regulated decision (a denied loan, a denied claim, a prior-auth narrative, an adverse-action notice), that explanation LLM is no longer just an inference workload. Under the Federal Reserve's SR 11-7 guidance on model risk management, any quantitative method that informs a business decision is a "model" and falls under the bank or insurer's model-risk-management program. That means model inventory, independent validation, ongoing performance monitoring, change-control logs, and periodic re-validation, typically a two-to-six-month exercise per model version (Federal Reserve, SR Letter 11-7, "Guidance on Model Risk Management," April 2011). When the model is an LLM whose weights and prompt and tool-use scaffolding all sit on a third-party provider, the inventory entry has to capture each of those layers and the validation has to cover the joint behavior.
The Equal Credit Opportunity Act (ECOA, 15 U.S.C. §1691) layers on top. Under ECOA and Regulation B, a creditor that takes an adverse action against an applicant must provide a statement of "specific reasons" for the action. The CFPB's 2023 circular (Consumer Financial Protection Circular 2023-03) clarifies that creditors using complex algorithms, explicitly including generative AI, cannot fall back on generic templated reasons and remain compliant. If the explanation LLM hallucinates a reason that did not factor into the underlying decision, the creditor has issued a non-compliant adverse-action notice. The hallucination is not just a quality bug; it is a regulatory finding.
OCC Bulletin 2026-13 is the gray-zone wrinkle. The bulletin explicitly excludes general-purpose generative AI tools from the OCC's formal model-risk-management scope. The intent was to avoid sweeping every chat interface into a SR 11-7 review cycle. In practice, banks have read the carve-out conservatively and applied SR 11-7 anyway to any LLM whose output enters a regulated workflow. The legal posture is "the regulator did not require it, but the regulator did not bless skipping it either, and the cost of an examination finding exceeds the cost of validation." The result is that the gray zone collapses into the same operational burden as the in-scope case.
What this does to inference observability is concrete. Audit-log retention runs seven years for most banking workloads and ten years for insurance claims; each request needs the prompt, the model version, the system-prompt version, the tool-call graph, the raw model output, the post-processing, the eval result, and the final downstream decision, all joined on a request ID and immutable. The audit-log infrastructure (the storage, the indexing, the access controls, the retrieval API for a regulator request) typically runs two to three times the pure inference cost in regulated workloads when ops teams honestly account for it. The pharmacy LCPR of $0.0882 per accepted draft would land closer to $0.21 to $0.26 with the regulated audit stack rolled in.
Use the regulated-workload framing when: the LLM output enters a decision a regulator can ask you to defend (adverse-action notices, denied claims, prior-auth narratives, clinical decision support, suitability assessments). The deterministic gate (Part 1) is not optional in these workloads; it is the only artifact that survives a model swap intact. Skip the framing when the workload is purely internal: engineering productivity, customer support that does not issue binding decisions, content drafting that a human always reviews.
If you cannot explain a cost change from your observability system, you are missing economic signals. The minimum viable economic observability is: per-request token counts (input, output, cached), cache hit rate, quality gate result, and cost estimate.
Present LCPR and the loaded-cost roll-up to engineering, product, and finance. The trace-to-loaded-cost reconciliation is the artifact that connects all three teams' views. Run it monthly. Investigate deltas above 5%.
What to measure
Economic signal set: input/output tokens, cache hit rate, quality gate pass rate, retry rate, model/prompt version
Cost attribution by workload (not just total)
Cache hit rate by prefix and by prompt version
Trace-to-invoice delta by workload
Cost breakdown: inference, cache, eval, human, ops
LCPR by workload and by account
Loaded cost vs invoice by account
Economic observability requires joining data from multiple systems: the serving engine, the provider, the eval system, and the business system. If these cannot be joined on a common request ID, the economic picture is incomplete.
Reconciliation breaks when traces are incomplete (sampling, dropped events), the invoice covers a different period, credits distort the invoice, human costs are not attributed to workloads, or the revenue model is bundled.
Calculator hook
The observability signals feed the calculator. The calculator view takes traces, invoice, eval results, human costs, ops overhead, and contracted unit rates. It outputs: LCPR by workload, loaded-cost roll-up by account, delta analysis, and variance waterfall.
Chapter 26: Incidents, Review, And Forecast
The field problem
A model provider ships an update. The model version changes silently. Output format shifts subtly. JSON keys that were lowercase are now camelCase. The parsing layer handles most of it, but 8% of outputs fail validation. The quality gate catches it. The repair loop kicks in: each failed output is retried with an explicit format instruction, doubling the token cost for those requests. Human escalation rate rises because some repaired outputs are still wrong.
The incident lasts 6 hours before the team identifies the root cause. During those 6 hours, LCPR increases 40%. The total incident cost: $2,400 in additional inference, $800 in human escalation, and 12 engineer-hours in investigation and mitigation.
The mechanism
Inference incidents differ from traditional software incidents. The system is often technically “up.” Requests are served, latency is normal, error rate is low. But quality degrades. The degradation is silent until the quality gate catches it (if the quality gate covers the failure mode) or until users report it (if it does not).
Common inference incident categories:
Category |
Trigger |
Detection method |
|---|---|---|
Model behavior change |
Provider update, version change, weight refresh |
Quality gate regression, output distribution shift |
Prompt regression |
Prompt edit, tool schema change, context mutation |
Cache hit rate drop, quality gate regression |
Cache regression |
TTL change, routing change, prefix mutation |
Cache hit rate drop, TTFT increase, cost spike |
Capacity incident |
Traffic spike, GPU failure, provider outage |
Latency spike, error rate increase, queue depth |
Quality drift |
Gradual model degradation, eval set staleness |
Slow quality decline over weeks, user complaint trend |
The fallback tax
Fallback paths have their own economics. When the primary route fails, the fallback may be: a different model, a rule-based response, a cached answer, or human escalation. Each has a cost:
Model fallback: Different model may be more expensive (larger model as safety net) or cheaper (smaller model with lower quality).
Rule-based fallback: No inference cost, but lower quality and higher escalation risk.
Human fallback: Highest cost per resolution, but guaranteed quality for the cases that reach it.
The fallback tax is the additional cost of operating the fallback path, including the detection time, the switching cost, and the quality delta during the incident.
The naive answer
“We have auto-rollback. Incidents resolve themselves.”
Auto-rollback helps for capacity and availability incidents. It does not help for quality incidents where the system is technically up. Quality degradation requires quality detection, which requires evals that cover the failure mode. If the eval suite does not cover the failure mode, detection depends on user complaints, which are slow, noisy, and incomplete.
The better model
Inference incident response has three phases:
1. Detection. Quality gate catches the regression (fast) or users report it (slow). The detection time determines how much damage accumulates.
2. Mitigation. Roll back the change, switch to fallback route, or apply a workaround. The mitigation time determines the blast radius.
3. Resolution. Fix the root cause: update the prompt, pin the model version, fix the parsing, expand the eval suite. The resolution time determines whether the incident recurs.
For economics, track: incident cost (additional inference + human + engineering time), blast radius (requests affected during detection + mitigation window), and root cause category (model, prompt, cache, capacity, quality drift).
The inference review pack
The operational response to incidents and ongoing cost management needs a structured cadence. The inference review pack is a monthly artifact that connects engineering metrics to business decisions in a 30-minute review with engineering, product, and finance.
The pack opens with an LCPR summary: LCPR by workload, this month vs last month, with a delta explanation for each line. The delta explanation is the load-bearing column. It answers "why did cost change?" rather than "how much did it change?"
Workload |
LCPR (this month) |
LCPR (last month) |
Delta |
Primary driver |
|---|---|---|---|---|
Support chat |
$0.172 |
$0.168 |
+2.4% |
Output length +12% from prompt change |
RAG extraction |
$0.045 |
$0.051 |
-11.8% |
Cache hit rate improved after prefix reorder |
Coding agent |
$1.24 |
$1.18 |
+5.1% |
Repair rate +3pp from model version change |
From there the pack moves through quality, cost composition, capacity, incidents, and recommended actions. The quality view shows eval pass rate by workload, the trend over recent months, and any notable failure modes. The cost breakdown decomposes inference, cache, eval, human, and ops spend by workload, with a month-over-month waterfall that points to the component that drove the movement.
Capacity reporting splits by deployment shape. Dedicated workloads get a utilization-versus-latency view and a headroom assessment; serverless workloads get a volume trend and a rate-limit proximity check. The incidents section closes the operational loop: count, total cost, root causes, and the preventive actions that fell out of the post-mortem. The pack ends with two or three specific recommendations, each with an expected impact and an owner.
The review pack forces three disciplines: explanation over observation (every delta needs a proposed cause), multi-audience communication (LCPR is the shared language between engineering, product, and finance), and action orientation (every review produces 2-3 actions with owners).
Forecasting and planning
A finance partner asks: “What will inference cost next quarter?” The answer is scenario modeling, not trend extrapolation. The cost is a function of multiple variables that change independently: volume, token mix, cache hit rate, model, route, quality gate strictness, and operational overhead.
Build forecasts from the LCPR components, not from the total:
Each component gets its own forecast:
Component |
What drives it |
How to forecast |
|---|---|---|
Volume |
Product roadmap, seasonal patterns, user growth |
Product team input + historical seasonality |
Inference cost per request |
Model, token mix, cache hit rate, provider pricing |
Current LCPR × expected changes |
Eval cost per request |
Eval sample rate, grader model, grader complexity |
Current rate × planned eval changes |
Human cost per request |
Escalation rate, labor rate, automation improvements |
Current rate × quality improvement trajectory |
Acceptance rate |
Quality gate, model quality, prompt quality |
Current rate × expected quality changes |
Scenario table
Scenario |
Volume change |
Model change |
Cache change |
LCPR impact |
Quarterly cost |
|---|---|---|---|---|---|
Base case |
+5% |
None |
Stable |
+5% |
$127,500 |
New workload |
+5% existing + 20K new |
None |
New workload cold |
+12% |
$136,000 |
Model migration |
+5% |
Cheaper model, -30% token cost |
Rebuild cache |
-18% (after ramp) |
$106,000 |
Prompt change |
+5% |
None |
Hit rate drops 20pp |
+22% |
$148,000 |
All planned changes |
+5% + 20K new |
Migration |
Rebuild + new workload |
Range: -5% to +15% |
$115,000 - $140,000 |
The forecast is a range, not a point. Present the range with the assumptions that drive each scenario. Let finance choose the scenario for budget planning.
Treat quality incidents with the same severity as availability incidents. Invest in quality detection proportionally to the cost of undetected quality degradation. Produce the inference review pack monthly. Forecast by component and by scenario. Present ranges, not points.
What to measure
Mean time to detect quality incidents
Incident cost by category (inference, human, engineering)
Blast radius (requests affected before mitigation)
Fallback activation rate and fallback cost
Root cause distribution (model, prompt, cache, capacity, drift)
LCPR by workload, month-over-month
Cost breakdown by component, month-over-month
Action completion rate from previous reviews
Forecast accuracy: predicted vs actual by component
Assumption tracking: which assumptions were wrong, by how much
Quality incident detection requires evals that cover the failure mode. If the failure mode is new, the eval suite will not catch it until the cases are added. Continuous log review of actual production outputs is essential as a complement to automated evals.
The review pack requires data from traces, invoices, evals, and support systems. Start with what you have. A review pack with LCPR from traces and invoice total (without full reconciliation) is better than no review pack.
Forecasting breaks when the team does not know about planned changes, model pricing changes without notice, or the forecast model is not updated with actuals. The biggest forecasting failure is presenting a single number without assumptions.
Calculator hook
The incident cost view tracks additional inference cost during incident, human escalation cost, engineering time, and blast radius. The review-pack view is a calculator output template with the six sections above. The forecast view takes current LCPR by workload, planned changes, and outputs quarterly cost by scenario with an assumption register tagging each input as measured, estimated, or assumed.
Chapter 27: Cost Attribution and Chargeback
The field problem
A single inference platform serves four product teams. The provider invoice arrives once a month as one line item. Finance asks engineering to allocate the spend. Engineering pulls usage from the platform logs and discovers the four teams cannot be reconciled to the invoice within 6%. Two of the teams insist their workloads are smaller than what the logs show. One team is using a fine-tuned LoRA that costs no more than the base model on the invoice but should be allocated differently for internal margin. The fourth team's traffic spikes are absorbed by spare capacity that exists only because team one's workload is bursty in the opposite direction.
The platform owner had three weeks to produce a chargeback model. The teams had to agree that the model was fair before the next budget cycle. None of the standard frameworks for cost allocation handle workloads that share capacity, share caches, share fine-tunes, or share the same provider account.
Why attribution is not the same as measurement
Measurement asks how much was spent. Attribution asks who pays. Two requests can be measured identically and attributed differently. A 4,000-token completion served at 6am on a dedicated endpoint costs the platform the same as the same completion served at 6pm, but the 6pm request occupied capacity that another team would have used; the 6am request did not. Attribution that ignores this asymmetry rewards teams whose workload shape happens to align with the platform's underused windows and punishes teams whose workload shape forces capacity additions.
The decision is not which model is mathematically correct. It is which model survives a negotiation with four teams that all want the lowest number. The attribution framework that minimizes negotiation cost across renewal cycles is usually the one that maps cleanly to observable per-team behavior, even if it loses some accuracy at the margins.
Four allocation modes
Most production attribution settles into one of four shapes:
Uniform. Total spend divided by team count, or by seat count, or by some other coarse denominator. Easiest to compute. Easiest to game. Defensible only when usage is genuinely similar across teams; almost never true once one team has a high-fanout agentic workload and another has a chat workload. The right default for the first 90 days of a platform when nobody has the data to argue with it.
Proportional. Each team's share of measured usage (tokens, requests, accepted outputs) multiplied by the invoice total. Reconciles to the invoice exactly. Loses the asymmetry: a team with bursty traffic that requires capacity additions pays the same per-token as a team with smooth traffic that fits in the gaps. Right when traffic shape is roughly uniform across teams or when capacity is genuinely elastic (serverless) and the invoice reflects only consumed work.
Weighted by SLA and shape. Proportional, but each team's share is multiplied by a factor that reflects how much capacity their workload requires beyond average use. A coefficient over 1 for high-burst or tight-SLA workloads; under 1 for batch-tolerant or off-peak workloads. The coefficients are usually inherited from the SLA tier the team purchased internally (tier-1 latency = 1.4x; tier-3 batch = 0.7x). Reconciles to the invoice if the weights sum correctly. Survives the negotiation because the coefficients are tied to a tier choice the team already made.
Hybrid. A floor (uniform or per-seat) covering the platform's fixed cost (engineering time, the eval system, monitoring, the on-call burden) plus a usage-weighted variable component for inference cost itself. The floor solves the "team one moved their workload off the platform and the platform still costs the same to run" problem. The variable component handles the usage asymmetry. This is what most mature internal platforms end up running.
The shared cache and fine-tune problem
Two specific allocation cases break naive proportional models.
The first is prompt caching. A shared system prompt across teams pays for itself the first time it loads, then runs at a 50-90% discount for every subsequent team that hits it. The team that warmed the cache subsidizes the teams that benefit. Three options: pretend the discount does not exist and bill each team the un-cached rate (the platform pockets the discount, teams underconsume the cache); pass the cache discount through evenly by measured token (the warming team subsidizes the followers); attribute the cache to the first team that used it in each TTL window (penalizes the warming team in proportion to the cache TTL). The third is fairest in principle and impossible to audit in practice. Most platforms settle on the second.
The second is fine-tunes and LoRA adapters. A LoRA trained by one team that improves accuracy for another team's workload is a transfer payment the proportional model misses. If the LoRA training cost is allocated only to the team that ran the training, that team subsidizes the consumers. If it is allocated to anyone who calls the LoRA, the team that trained it captures revenue inside its budget that did not flow through any contract. Most platforms either (a) charge the training cost to the training team and let the benefit flow to consumers (encourages investment, punishes the investor at budget time), or (b) treat the LoRA as a shared asset with a one-time amortization across all expected callers (cleaner economically, harder to negotiate).
What goes in the internal chargeback report
The monthly chargeback report a platform owner sends to product teams should contain, at minimum:
Total spend by team, broken into inference, eval grader, human escalation, and operational allocation.
LCPR by team, with the denominator (accepted outputs) defined per team's quality gate. A team running answer-drafting and a team running batch enrichment should not share an LCPR definition.
Variance versus prior month and versus quarterly forecast, with the largest three deltas explained.
Cache and fine-tune transfers (which team warmed which cache; which team's fine-tune is callable by which other team).
Capacity attribution: how much of the dedicated capacity each team's workload pinned, and what fraction of that capacity went unused.
One paragraph of trend analysis. Not "team X is up 12%" but "team X moved from chat to agentic and the per-accepted-output cost reflects the fanout."
The report is for the budget conversation, not the operational one. Treat it as a quarterly negotiation artifact rather than a real-time dashboard.
An anonymized case: a mid-size public-sector platform
A consolidated state-agency platform served seven program offices on one inference contract. The initial chargeback model was proportional by token. After two quarters, three offices stopped using the platform and stood up their own contracts; the remaining four absorbed the fixed cost and complained loudly. The platform owner switched to a hybrid model: a $14K monthly floor split by seat count covering the contract minimum and the platform engineering team, plus a per-token usage charge above the floor at the rate that reconciled to the invoice. Two of the three departed offices came back within a quarter when their standalone contracts ran into minimum commits they could not fill. The fourth never returned. The lesson the owner reported: the proportional model was technically accurate and politically dead. The hybrid model was technically inferior on margin attribution and survived three renewal cycles.
The side-finding: the floor exposed how much of the platform cost was actually platform overhead rather than inference. The seven offices, taken together, were paying about $42K/month in inference and $14K/month in fixed costs. The first reaction inside finance was to drive the fixed cost down. The second reaction, after the platform team explained what the floor covered, was to consolidate two adjacent platforms onto the same floor. The marginal cost of supporting an eighth office was small, and the existing four were happy to dilute their floor share with new entrants.
Attribution fails when teams share a single LLM call across two product surfaces (the call cannot be split without re-running it), when the eval system is centralized and the cost of running evals is not allocated by which team's outputs were graded, or when a fine-tune fails in production and the cost of the rollback (re-running on the previous model) is borne by the team that requested the fine-tune rather than the team that approved it.
Attribution also breaks during incidents. A two-hour outage costs the platform's dedicated capacity the same as a normal two-hour window, but no team accepts being billed for capacity that produced no accepted outputs. The usual remedy is to credit incident windows against the affected teams' usage, paid out of the platform's reserve. The reserve has to be funded by something, usually a small percentage of every team's usage charge, which is itself an attribution decision that teams will eventually challenge.
Calculator hook
The cost-attribution view takes per-team trace volume, per-team accepted output counts, the team's SLA tier coefficient, the cache and fine-tune transfer table, and the platform fixed cost. It outputs allocated spend by team under each of the four allocation modes, along with the variance between modes so the platform owner can see how much of the negotiation is mode-driven rather than usage-driven.
Use the hybrid mode when the platform has more than two teams, runs a dedicated capacity floor, or has fixed engineering overhead that does not vary with usage. Use proportional when the platform is serverless and every dollar of cost is directly traceable to a team's request. Use weighted when teams have explicit SLA tiers. Use uniform only for the first quarter of a platform, before there is enough data to argue.
Part 5 Summary
Part 5 established the operating cadence for inference economics and the attribution mechanics that decide who pays for what:
Concept |
Where defined |
What it does |
|---|---|---|
Baseline and eval instrument |
Chapter 23 |
Snapshot before change; three-layer eval stack controls LCPR denominator |
Benchmark hygiene and goodput |
Chapter 24 |
goodput frontier test replaces vendor throughput numbers |
Observability and trace-to-loaded-cost |
Chapter 25 |
Economic signals explaining cost movement; joins trace, invoice, eval, and contract into a reconciled loaded cost |
Incidents, review, and forecast |
Chapter 26 |
Quality incidents, review pack cadence, scenario-based LCPR forecasting |
Cost attribution and chargeback |
Chapter 27 |
Allocation modes (uniform, proportional, weighted, hybrid) that divide a shared inference bill across teams |
The operating cadence forms a loop:
Each step feeds the next. The baseline enables measurement. Measurement feeds evaluation. Evaluation enables reconciliation. Reconciliation informs the review. The review drives actions. Actions change the system. The forecast predicts the impact of changes. And the cycle starts again with a new baseline.
The book opened with the deterministic gate as the cheapest unit of inference economics. Part 5 closes the loop: measure LCPR, evaluate quality, reconcile traces to invoices, review monthly, forecast by scenario, and act on evidence. The gate keeps the cheapest workloads cheap. The cadence keeps the rest honest.
Evidence Notes for Part 5
# |
Claim |
Label |
Source |
Chapter |
|---|---|---|---|---|
1 |
Goodput under SLO formula and worked example |
DERIVED |
Derivation 5, from SLO definition and standard metrics |
24 |
2 |
Trace-to-loaded-cost reconciliation formula |
DERIVED |
Derivation 6, from LCPR extended to invoice and revenue |
25 |
3 |
Benchmark-to-production transfer gap exists |
DERIVED |
From roofline analysis (Part 2) applied to benchmark conditions |
24 |
4 |
Output tokens typically cost 2-6x input tokens on major API providers |
PUBLIC |
Provider pricing pages, accessed 2026-05-12 |
25 |
5 |
Eval design needs task, trial, grader, assertions, transcript |
PUBLIC |
Anthropic agent eval guidance, OpenAI eval docs |
23 |
6 |
SWE-bench Verified flawed tests and contamination |
PUBLIC |
OpenAI benchmark analysis post, 2026-03 |
24 |
7 |
Cost optimization can reduce quality (Claude Code postmortem) |
REPORTED |
Anthropic engineering postmortem, 2026-04-23 |
26 |
8 |
Three-layer eval stack (deterministic, model-graded, human) |
DERIVED |
From Anthropic, OpenAI, and Google eval guidance synthesized |
23 |
9 |
inference review pack structure |
OPINION |
Author synthesis of operating review practices |
26 |
10 |
Scenario-based forecasting with LCPR components |
OPINION |
Author synthesis of financial planning applied to inference |
26 |
11 |
Quality incidents differ from availability incidents |
DERIVED |
From inference serving characteristics vs traditional services |
26 |
12 |
Cache hit rate drop causes cost spike (prompt change example) |
DERIVED |
From cache break-even analysis (Part 2) applied to operations |
25, 26 |
13 |
Minimum viable baseline metrics |
OPINION |
Author synthesis from production observability |
23 |
14 |
Delta investigation threshold 5% |
OPINION |
Author recommendation, calibrated from Part 1 |
25 |
15 |
All worked example numbers |
SYNTHETIC |
Illustrative, shaped by real billing semantics |
25, 26 |