Chapter 1: Token Price Is Not Cost
Field Problem
Three inference workloads share one provider account: support chat, document extraction, and a coding agent. The provider bills by input tokens, output tokens, cached input tokens, and cache write tokens. Finance reports cost per million tokens. Engineering reports cost per request. Product reports cost per feature.
None of these numbers agree. None of them answer the question that matters: what does it cost to produce one unit of accepted work?
The Billing Grammar
Every provider invoice speaks a billing grammar. The components vary, but the structure is consistent:
Component |
What it covers |
Who charges it |
|---|---|---|
Input tokens |
Tokens sent to the model (prompt, system, tools, context) |
All major API providers |
Output tokens |
Tokens generated by the model |
All major API providers |
Cached input tokens |
Input tokens served from prefix cache instead of recomputed |
Most providers (discount varies) |
Cache write tokens |
Input tokens written to cache (can cost more than uncached input) |
Anthropic, Google |
Cache storage |
Hourly/daily rent for cached state |
Google (explicit), others (implicit) |
Batch tokens |
Tokens processed in asynchronous batch mode (typically 50% discount) |
OpenAI, Anthropic, Google |
Tool use |
Additional tokens for tool definitions and results |
Some providers (implicit in token count) |
Image/audio tokens |
Tokens derived from non-text input |
Multimodal providers |
Fine-tuning |
Training cost for custom model weights |
Providers offering fine-tuning |
Dedicated capacity |
Hourly/monthly fee for reserved compute |
AWS, Azure, Baseten, CoreWeave, DeepInfra, Fireworks, Google, Lambda, Replicate, Together |
Minimum billing |
Minimum input tokens per request (e.g., 128 or 1,024) |
Some providers |
The billing grammar is the first thing to understand because it defines what the provider measures. But what the provider measures is not what the product needs to measure.
What The Provider Measures vs What You Need
The provider measures token throughput and bills accordingly. Your product measures task success. The gap between these two measurements is where cost models go wrong.
Provider sees |
Product needs |
|---|---|
Input tokens consumed |
Relevant input tokens (how much was retrieval waste?) |
Output tokens generated |
Accepted output tokens (did the answer pass quality?) |
Cache hits |
Cache hits on stable prefixes (were the hits meaningful?) |
Requests served |
Requests that completed within SLO |
Uptime |
Reliability at the tail (p99, not mean) |
Total usage |
Usage attributable to a workload, customer, or account |
A provider can report 100% uptime while your workload experiences p99 latency violations, cache miss storms, and quality regressions. The metrics live in different systems and answer different questions.
What Is Not On The Invoice
Token price captures one line on the LCPR statement of work: the provider invoice. Four other lines are typically larger combined.
Retries add 4-8% on most workloads, more on workloads with strict structured-output gates. The retry is invisible if you divide total spend by total requests rather than by accepted work.
Eval and grader calls are themselves inference, sized to the eval cadence and the grader model. A quality-sensitive workload running LLM-graded eval on 20% of traffic with a same-class grader runs roughly 5-15% of primary inference in grader cost. Deterministic checks and human review run a different cost surface but are still off the inference invoice.
Repairs (retry with corrected input: better context, longer window, adjusted prompt) concentrate on the failure tail. 5-12% of traffic is typical for RAG workloads where retrieval quality is uneven.
Human escalation usually dominates the loaded cost on quality-sensitive interactive workloads. Per-escalation cost runs 50-300x the per-request inference cost, depending on the support staffing model.
Operational overhead covers on-call, deployment, prompt and eval upkeep. It is a fixed cost amortized into per-request economics only if you allocate it. Most teams don't.
A workload where retries plus eval plus repair plus escalation plus ops are under 20% of inference is unusual. A workload where they are under 50% is the case where token price is a useful proxy for cost. The rest of the book is about the workloads where they are not.
The Unit Problem
The most common mistake in inference economics is choosing the wrong denominator.
Denominator |
What it hides |
|---|---|
Per million tokens |
Whether those tokens produced useful work |
Per request |
Whether the request succeeded |
Per API call |
Retries, repairs, and multi-step workflows |
Per user |
Variance across user behavior and workload mix |
Per GPU-hour |
Whether the GPU produced accepted work or wasted cycles |
The correct denominator depends on the workload. For a support answer system, it is “one accepted answer that passed quality, latency, and compliance gates.” For a coding agent, it is “one merged fix that passes tests and review.” For a document extraction pipeline, it is “one correctly extracted record set.”
The denominator must be defined before the cost model is built. If you cannot define what an accepted work unit is for your workload, your cost model is measuring activity, not value.
Do not compare inference options by token price. Compare by loaded cost per accepted work unit.
Token price is one input to the cost model. It is not the cost model.
Very simple workloads (single-turn, low-stakes, no quality gate) where token cost genuinely dominates and retries are rare. For these workloads, token price is a reasonable proxy.
Batch workloads where the 50% batch discount is the dominant economic lever and quality is checked downstream.
Workloads where the provider bill is a rounding error relative to the product’s overall cost structure. If inference is 0.1% of operating cost, the loaded-cost exercise is not the binding constraint.
Calculator Hook
Every calculator view in this manual normalizes to accepted work units, not raw tokens. The view takes raw traces and produces loaded cost per accepted request. The sensitivity analysis shows which input dominates the result.
If you do not yet have the trace coverage to drive the calculator, jump to the twenty-trace exercise in Chapter 3. It is the minimum viable starting point and the prerequisite for everything else in Part 1.
Chapter 2: LCPR — Loaded Cost Per Result
Field Problem
Two teams at the same company both report “inference cost per request.” Team A divides the monthly provider bill by total API calls. Team B divides total loaded cost (inference + eval + human review + ops) by accepted work units. Team A reports $0.014. Team B reports $0.172 for a similar workload. Finance asks which number is correct.
Both are calculated correctly. They answer different questions. Team A’s number tells you the average inference spend per API call. Team B’s number tells you what it costs to produce one unit of accepted work. Only Team B’s number is useful for margin analysis, pricing decisions, migration comparisons, or capacity planning.
The Formula
Symbol |
Unit |
Meaning |
|---|---|---|
C_inference |
USD |
Trace-derived inference cost (all calls: first attempt + retry + repair) |
C_eval |
USD |
Eval grader cost (LLM grader calls, rubric checks) |
C_human |
USD |
Human review, escalation, and repair labor |
C_ops |
USD |
Amortized operational overhead allocated to this workload |
delta |
USD |
Discrepancy between trace-derived cost and provider invoice |
A |
count |
Accepted work units (tasks passing quality, latency, and compliance gates) |
The numerator includes all costs incurred in producing the output, including the cost of failed attempts. The denominator includes only accepted output.
This is a definition, not a simplification. LCPR is what it costs to produce accepted work. The alternative is dividing inference spend by total requests, which produces a number that is accurate but not useful. Pricing snapshots, provider semantics, and cache behavior change. The formula is durable; the inputs are dated. See the assumption register in Chapter 3.
Worked Example
Anonymized: a mid-market property and casualty insurance carrier runs an internal-only model that drafts claims-adjuster guidance answers. Adjusters open a case, ask a natural-language question ("does this water-damage claim fall under the burst-pipe rider on policy form HF-3?"), and the model produces a draft response with citations to the underlying policy language. Adjusters edit and approve before any text reaches the policyholder. Roughly 11,800 queries per weekday. Three-gate validation pipeline: PII redaction, policy-rubric eval, and a compliance-regex pass that flags any draft containing boilerplate language tied to coverage decisions the model is not authorized to make.
In a six-week window the inference bill climbed 31% while case volume grew 7%. Quality pass rate was flat. Latency was stable. The team's first guess was traffic-mix shift toward longer-context cases, then a provider price change on the underlying model. Both were wrong. The trace-by-rejection-reason breakdown was the disambiguating measurement: compliance-regex rejections had climbed from 6.3% of traffic in the prior quarter to 22% in the current one. A risk-management team had widened the boilerplate regex to flag any draft mentioning "covered," "not covered," "subject to," or "pursuant to" outside of a citation block. Most flagged drafts were false positives. Each rejection paid the full input cost again on the regenerate cycle.
The denominator-collapse showed up two weeks later. Adjusters with experience on the prior tool noticed the AI was returning to the queue more often than not. They started routing around it: typing answers manually, skipping the suggestion panel entirely. A traffic audit found that once the false-positive rate exceeded 18%, daily accepted drafts dropped from roughly 9,400 to roughly 5,100. The numerator (gross inference spend) was up 31%. The denominator (accepted drafts) had collapsed to 54% of its prior level. LCPR roughly doubled, and the doubling was almost entirely from the denominator side.
Computing LCPR over a representative week, with the numerator including all failed regenerate-cycles and the denominator counting only drafts that cleared all three gates and reached an adjuster who accepted them:
C_inference (trace-derived, all attempts): roughly $2,840
C_eval (policy-rubric grader pass): roughly $173
C_human (review queue allocated to drafts that bypassed adjuster acceptance): roughly $1,290
C_ops (pipeline maintenance, allocated): roughly $164
A (accepted drafts that passed all three gates and reached an adjuster who accepted them): roughly 27,300 over the week
LCPR: roughly $0.163 per accepted draft
What Each Component Contributes
Component |
Amount |
% of LCPR |
|---|---|---|
Inference (trace) |
$2,840 |
63.7% |
Eval grader |
$173 |
3.9% |
Human review queue |
$1,290 |
28.9% |
Ops overhead |
$164 |
3.7% |
Total numerator |
$4,467 |
100% |
The team's first reflex on seeing a 31% inference-bill jump was to negotiate a token-price discount. The breakdown said the move was a deterministic pre-validation pass that flagged drafts likely to fail the regex and routed them straight to a manual-edit path rather than regenerate: 9ms per draft, zero LLM tokens. After deployment, gross inference cost dropped 17% in two weeks. The accepted-draft count began recovering as adjusters returned to the suggestion panel; within four weeks A had climbed back toward its prior level.
The side-finding: the deterministic pre-validation surfaced a class of drafts that passed the regex but contained coverage-determination language that the original regex authors had not anticipated. Roughly 3.4% of historically accepted drafts were re-classified as needing risk-management review. The compliance team had been treating the regex as the gate. It was missing a category of language entirely. The remediation took two more quarters of regex and rubric work and was a much larger project than the cost overrun that started the investigation.
The LCPR pointed at the wrong lever first (token price). The numerator-and-denominator breakdown pointed at the right one (the deterministic gate and adjuster trust). The gate itself turned out to be incomplete in the other direction. Inference is 64% of LCPR on this workload, but the lever that moved LCPR most was not the inference cost. It was the false-positive rate driving the denominator.
The Accepted Work Unit
The denominator deserves its own definition. An accepted work unit is:
One task completion that passes the workload’s quality gate, meets its latency SLO, complies with its data-handling constraints, and does not require human intervention.
Defining the accepted work unit requires answers to four questions:
What is the task? Not “an API call” or “a request.” A task is the smallest unit of work that produces value: an accepted support answer, a correctly extracted record, a merged code fix, a graded exam response.
What is the quality gate? An eval score threshold, a deterministic check, a human review pass, or a downstream system acceptance. Without a quality gate, every completion counts as accepted, and the denominator is meaningless.
What is the latency SLO? Interactive workloads have TTFT and E2E constraints. Batch workloads have completion-window constraints. If a response is correct but arrives after the deadline, it may not count as accepted work.
What are the compliance constraints? Data residency, content logging restrictions, PII handling, retention policy. A response generated on the wrong infrastructure or stored in violation of policy is not accepted work regardless of quality.
If any of these four questions cannot be answered for a workload, the LCPR calculation will produce a number, but the number will not be useful for decisions. Define the accepted work unit before building the cost model.
Report inference economics as LCPR per accepted work unit. Show the component breakdown. Make the dominant cost lever visible.
If inference dominates LCPR, optimize serving. If human escalation dominates, optimize quality. If ops overhead dominates, automate or consolidate. The lever depends on the breakdown, not on the token price.
Shared infrastructure. When multiple workloads share a dedicated endpoint, allocating C_ops and the GPU cost to individual workloads requires an allocation model. Allocation models are always wrong; the question is whether they are useful.
Amortized costs. Prompt engineering, fine-tuning, eval set construction, and model evaluation are upfront investments. Amortizing them into LCPR requires a time horizon and volume assumption.
Revenue attribution. LCPR tells you cost per accepted unit. Margin requires revenue per accepted unit. Revenue attribution for AI features embedded in larger products is often ambiguous.
Very low volume. At 10 requests per day, the ops overhead per request is large and unstable. LCPR is most useful when volume is high enough for the per-unit allocation to be meaningful.
Calculator Hook
The calculator view takes as input: trace data, invoice data, eval results, human review costs, and ops allocation. It produces: LCPR per accepted unit, component breakdown, sensitivity to each input, and trend over time.
Chapter 3: Trace, Invoice, Eval, and Contract
Field Problem
An engineering lead asks: “What is our inference cost?” Four people give four answers:
The engineer checks the provider dashboard: $14,200 per month.
The finance team checks the invoice: $14,850 per month.
The ML engineer checks the trace aggregator: $13,900 per month.
The product manager checks the customer contract: revenue of $52,000 per month against that spend.
Each person is looking at a different data source. Each source answers a different question. None of the four, alone, is sufficient for an inference economics decision.
The Four Data Sources
Inference economics requires joining four data sources. Each provides information the others lack. Each has failure modes the others correct.
Source 1: The Trace
The trace is the request-level event log. Each row represents one inference call: timestamps, token counts by type, latency percentiles, model, provider, route, cache state, retry count, tool calls, and outcome.
What the trace provides: - Request-level cost attribution (which workload, which customer, which route) - Latency breakdown (TTFT, TPOT, queue time, tool time) - Cache behavior (hit/miss, prefix length, TTL) - Failure classification (timeout, rate limit, quality fail, malformed output) - Retry and repair chains (linking retries to their original request)
What the trace misses: - Rounding and pricing adjustments applied by the provider - Credits, commitments, and contract terms - Costs outside the inference call (human review, ops overhead) - Whether the provider’s token count matches your tokenizer’s count
Traces fail in characteristic ways: sampling or logging gaps drop calls (especially async tool calls and retry chains); the cost calculation uses a stale pricing snapshot from before a provider change; the trace's token count disagrees with the provider's tokenizer; and model-name aliases ("claude-sonnet-4" in the trace, "claude-sonnet-4-20260415" on the invoice) break the join. The model-name alias issue eats the most time in practice and is the one to instrument first.
Source 2: The Invoice
The invoice is the provider’s billing record. It is the ground truth for what was charged, but not for why.
What the invoice provides: - Actual charges after rounding, minimums, batch adjustments, and credits - Billing period boundaries - Committed spend tracking - Tax and compliance line items
What the invoice misses: - Request-level attribution (which workload, which customer) - Why costs moved (was it volume, model mix, cache miss rate, or output length?) - Latency, quality, or reliability information - Costs from other providers in a multi-source architecture
Invoices fail in characteristic ways: end-of-month usage posts to the following invoice and breaks trace-invoice reconciliation across the boundary; aggregate line items combine workloads that traces separate; invoice model aliases do not match trace model names; and credits apply inconsistently across line items, so the charged-amount displayed on the dashboard can be wrong by a credit cycle.
Source 3: The Eval
The eval is the quality measurement. It answers whether the inference output was accepted work or failed work.
What the eval provides: - Pass/fail determination per output - Quality score distribution - Failure clustering (which failure modes are common) - Repair rate and repair success rate - The denominator for LCPR
What the eval misses: - Cost information - Latency information (unless combined with trace) - Whether the eval itself is calibrated and reliable - Long-term quality drift vs point-in-time score
Evals fail in characteristic ways: the LLM grader and a human reviewer disagree on the same outputs at rates above 10%; the eval set was built during initial development and no longer reflects production failure-mode frequency; the quality threshold is set low enough that everything passes (and real failures are invisible) or high enough that the workload triggers repairs and escalations it does not need; and eval coverage is partial, with a 5-10% sample on a long-tail workload missing the tail entirely.
Source 4: The Contract
The contract is the commercial agreement between your organization and the inference provider, or between your organization and your customer.
What the contract provides: - Revenue per unit of work (for margin calculation) - Committed spend and credit balances - SLA terms and penalty triggers - Data handling and residency requirements - Rate limits and burst allowances - Overage pricing
What the contract misses: - Actual usage patterns - Quality or latency performance - Whether committed spend is being utilized efficiently - Whether the customer is actually using the features that generate inference cost
Contracts fail in characteristic ways: committed spend does not match actual usage patterns, so over-commit burns credits and under-commit pays overage rates; credits mask the true run-rate cost on the dashboard; SLA terms do not match the product's actual reliability profile; and revenue recognition is out of sync with inference-cost timing, so margin reads correctly only at a quarterly aggregate.
Joining The Four Sources
None of the four sources is sufficient alone. Inference economics requires joining them:
The join is difficult because:
Traces and invoices use different time boundaries.
Traces have request-level granularity; invoices have aggregate line items.
Evals may cover a sample, not the full population.
Contracts define terms that change the economics but do not produce telemetry.
The join produces a number. But every cost model depends on inputs, and not all inputs are equally trustworthy. A team builds a cost model, puts LCPR = $0.14 and margin = 45% into a quarterly review deck, and six months later margin has dropped to 28%, because three assumptions were wrong from the start: cache hit rate was 35-40% not 60%, output length p50 was 340 tokens not 250, and human escalation rate was 5.5% not 3%. Nobody tracked these assumptions. Nobody owned them. Nobody checked them.
The Assumption Register
Every LCPR calculation depends on inputs. Some inputs are measured. Some are assumed. The assumption register makes the distinction explicit.
Input |
Value |
Confidence |
Source |
Owner |
Last verified |
Refresh trigger |
|---|---|---|---|---|---|---|
Input token price (Route A) |
$3.00/MTok |
PUBLIC |
Provider pricing page |
Infra |
2026-05-12 |
Provider price change |
Cache hit rate |
60% |
ASSUMED |
Design target |
ML Eng |
Never measured |
Monthly trace review |
Output length p50 |
250 tok |
ASSUMED |
Prompt test (n=50) |
ML Eng |
2026-04-01 |
Monthly trace review |
Eval pass rate |
85% |
MEASURED |
Eval pipeline output |
ML Eng |
2026-05-10 |
Weekly |
Human escalation rate |
3% |
ASSUMED |
Industry estimate |
Support |
Never measured |
Monthly |
Ops overhead/day |
$25 |
ESTIMATED |
Eng time allocation |
Eng Mgr |
2026-03-15 |
Quarterly |
The columns that matter most:
Confidence: MEASURED (from production telemetry), PUBLIC (from official source), ESTIMATED (calculated from limited data), ASSUMED (no direct measurement), CONTRACTED (from legal agreement).
Last verified: The date someone checked this input against reality. “Never measured” is a valid and important entry.
Refresh trigger: What event should cause this input to be re-checked.
Confidence Levels
Not all inputs are equally trustworthy. Confidence labels prevent a cost model from treating a pricing-page number and an industry-average guess with equal weight.
Level |
Meaning |
Example |
Risk |
|---|---|---|---|
MEASURED |
Observed from production telemetry, reproducible |
Cache hit rate from 30 days of traces |
Low—if measurement is correct |
PUBLIC |
Stated by provider in official documentation |
Token price from pricing page |
Low—until provider changes it |
CONTRACTED |
In a signed agreement |
Committed spend, SLA penalties |
Low—unless contract is renegotiated |
ESTIMATED |
Derived from partial data or short observation |
Output length from a 50-request sample |
Medium—sample may not represent production |
ASSUMED |
No direct measurement, based on belief or analogy |
Human escalation rate from industry report |
High—assumption may be wrong from day one |
UNKNOWN |
Not yet investigated |
Cache-breaking event rate |
Critical—the input could be anything |
A cost model with three ASSUMED inputs and two UNKNOWN inputs is not wrong. It is incomplete. The assumption register makes the incompleteness visible so the team knows where to invest in measurement.
Data-Quality Conditions
Before a cost model is used for a decision (migration, pricing change, capacity reservation, vendor negotiation), five conditions should hold.
Coverage. Trace coverage above 95%. Missing traces are not random; they correlate with timeouts, errors, and edge cases, so a 90%-coverage trace systematically under-reports the cost of the tail.
Freshness. Pricing snapshot, cache hit rate, and eval pass rate all dated within their refresh trigger. Stale inputs flow through unchanged and silently lie.
Reconciliation. Trace-derived cost within 5% of invoice for the same period. If the gap is wider, the trace is not trustworthy for cost modeling until the gap is explained.
Denominator. The accepted work unit defined and the quality gate running. Without an eval, the denominator is "requests" rather than "accepted work" and LCPR is understated.
Confidence floor. Fewer than two key inputs in ASSUMED or UNKNOWN status. Above that threshold, the model is a hypothesis. Label it that way in the deck and use it for scoping, not for commitments.
The Unknown Bucket
Every workload has traffic that cannot be attributed, classified, or evaluated. Requests with missing workload IDs. Calls from deprecated integrations. Test traffic mixed into production. Retry chains where the original request trace was lost.
The honest response is not to ignore this traffic. It is to create an unknown bucket and measure its size.
Metric |
Value |
Action threshold |
|---|---|---|
Requests with missing workload_id |
4% |
Investigate if > 5% |
Requests with missing eval result |
12% |
Investigate if > 10% |
Unattributed cost (trace vs invoice gap) |
3% |
Investigate if > 5% |
Traffic from deprecated routes |
1% |
Clean up |
If the unknown bucket is small (under 5% of cost), it is noise. If it is large (over 10%), it is corrupting the cost model. Unknown traffic that correlates with high-cost or low-quality behavior is especially dangerous because it hides the workload’s real economics.
The First Twenty Traces
When a team has no cost model, no eval, and no trace-to-invoice reconciliation, the starting point is not a dashboard or a calculator. It is twenty traces.
Pull twenty real requests from production. For each request, answer:
What workload is this? Can you name it?
What was the input? How many tokens? Was any of it cached?
What was the output? How many tokens? Did it pass quality?
What did it cost? Can you calculate the cost from the pricing snapshot?
Did it meet the latency SLO?
If it failed, what happened next? Retry? Repair? Escalation? Nothing?
Twenty traces will reveal:
Whether you can attribute requests to workloads
Whether your token counts match the provider’s
Whether your cache is hitting
Whether your quality gate is running
Whether retries and repairs are visible in the trace
If you cannot answer these questions for twenty requests, you cannot build a cost model. The twenty-trace exercise is the minimum viable data quality gate.
What To Measure
For each workload, the minimum viable join requires:
Field |
Source |
Purpose |
|---|---|---|
request_id |
Trace |
Link requests to outcomes |
workload_id |
Trace |
Attribute cost to workload |
model + provider |
Trace + Invoice |
Reconcile model aliases |
input_tokens, output_tokens, cached_tokens |
Trace |
Calculate trace-derived cost |
ttft_ms, e2e_ms |
Trace |
SLO compliance |
eval_result |
Eval |
Accepted vs failed |
invoice_line_item |
Invoice |
Actual charge |
revenue_per_unit |
Contract |
Margin calculation |
credit_balance |
Contract |
True cost after credits |
Build the four-source join for one workload first. Pick the workload with the highest spend or the most quality variance. Join one day of traces to the corresponding invoice period. Calculate LCPR. Measure the delta between trace-derived cost and invoice. If the delta is under 5%, the trace is trustworthy for daily monitoring. If it exceeds 5%, investigate before relying on trace-based cost reporting.
Before using the cost model for a decision (migration, pricing change, capacity reservation, vendor negotiation), check the assumption register. If more than two key inputs are ASSUMED or UNKNOWN, the model is directional, not reliable. Label it as such in the deck. Use it for scoping, not for commitments.
Upgrade assumptions to measurements by investing in trace coverage, eval pipeline, and invoice reconciliation, in that order. Trace coverage is first because without traces, you cannot measure anything else.
Multi-provider architectures where traces span multiple providers and invoices come from separate billing systems.
Real-time streaming where token counts are approximate until the stream completes.
Shared tenancy where one provider account serves multiple internal workloads and the invoice does not break down by workload.
Rapidly changing prices where the pricing snapshot used for trace-derived cost is stale by the time the invoice arrives.
Startup-stage teams that do not have enough volume for statistical confidence. At 50 requests per day, weekly metrics are noisy. Monthly aggregation may be the minimum useful window.
Multi-model workflows where a single task uses multiple models in sequence (retrieval, generation, grading, repair). The trace needs to link these into a single task lifecycle, which requires trace correlation.
Rapidly changing workloads where the distribution shifts faster than the refresh cycle. A cache hit rate measured last month may be wrong this week if the retrieval context or prompt structure changed.
Organizational silos where traces live in engineering, invoices live in finance, evals live in ML, and contracts live in legal. The join requires cross-functional access that many organizations do not have.
Calculator Hook
The calculator view performs the four-source join. Input: trace export, invoice CSV, eval results, contract terms. Output: LCPR, margin, delta analysis, variance waterfall showing which input changed and how much it moved the result.
The calculator includes an assumption register tab. Each input has a confidence field. The sensitivity analysis highlights which ASSUMED inputs would change the LCPR by more than 10% if the real value differs from the assumption. Those are the inputs worth measuring first.
Part 1 Summary
Part 1 established three ideas:
Token price is not cost. The provider bill is one component. Retries, repairs, eval, human escalation, and operational overhead are the rest. On quality-sensitive workloads, non-token costs can dominate.
LCPR is the metric. Loaded cost per result, normalized to accepted work units. The formula is simple. The difficulty is in the inputs: trace coverage, eval calibration, invoice reconciliation, and overhead allocation.
Four sources, one join, and a confidence gate. Traces provide attribution. Invoices provide ground truth. Evals provide the denominator. Contracts provide the revenue and constraints. No single source is sufficient. Every input to the cost model has a confidence level. ASSUMED inputs are hypotheses, not measurements. The twenty-trace exercise is the minimum viable starting point.
The next part is Serving Physics: the mechanisms behind the numbers. Output tokens cost more than input tokens; batch size sets the cost-latency frontier; cache hit rate is an economic variable; context length is a memory product. The economics in Part 1 depend on the physics in Part 2.
Evidence Notes for Part 1
Claim |
Type |
Source |
|---|---|---|
Provider billing grammar structure |
Public |
Provider pricing pages and API documentation |
LCPR formula and components |
Derived |
Constructed from standard cost accounting applied to inference |
Worked example (insurance carrier claims-adjuster guidance, LCPR roughly $0.163) |
Anonymized |
Anonymized scenario; numbers fabricated but representative of the pattern |
Human review queue cost |
Illustrative |
Order of magnitude; actual cost depends on review staffing model |
Cache hit rate degradation pattern |
Reported |
Shaped by production observations of prefix-cache eviction under multi-tenant load |
Invoice-trace delta patterns |
Reported |
Common reconciliation issues from production deployments |
Confidence level taxonomy |
Derived |
Standard data quality framework applied to inference inputs |
Output token length variance across models |
Public |
Observable from model documentation and evaluation |
The Chapter 2 worked example uses an anonymized scenario; numbers are fabricated but representative of the pattern. Other examples use synthetic numbers shaped by real provider semantics. No numbers should be attributed to any specific employer, customer, or deployment.