The Economic Unit

Chapters 1–3

Chapter 1: Token Price Is Not Cost

Field Problem

Three inference workloads share one provider account: support chat, document extraction, and a coding agent. The provider bills by input tokens, output tokens, cached input tokens, and cache write tokens. Finance reports cost per million tokens. Engineering reports cost per request. Product reports cost per feature.

None of these numbers agree. None of them answer the question that matters: what does it cost to produce one unit of accepted work?

The Billing Grammar

Every provider invoice speaks a billing grammar. The components vary, but the structure is consistent:

Component

What it covers

Who charges it

Input tokens

Tokens sent to the model (prompt, system, tools, context)

All major API providers

Output tokens

Tokens generated by the model

All major API providers

Cached input tokens

Input tokens served from prefix cache instead of recomputed

Most providers (discount varies)

Cache write tokens

Input tokens written to cache (can cost more than uncached input)

Anthropic, Google

Cache storage

Hourly/daily rent for cached state

Google (explicit), others (implicit)

Batch tokens

Tokens processed in asynchronous batch mode (typically 50% discount)

OpenAI, Anthropic, Google

Tool use

Additional tokens for tool definitions and results

Some providers (implicit in token count)

Image/audio tokens

Tokens derived from non-text input

Multimodal providers

Fine-tuning

Training cost for custom model weights

Providers offering fine-tuning

Dedicated capacity

Hourly/monthly fee for reserved compute

AWS, Azure, Baseten, CoreWeave, DeepInfra, Fireworks, Google, Lambda, Replicate, Together

Minimum billing

Minimum input tokens per request (e.g., 128 or 1,024)

Some providers

The billing grammar is the first thing to understand because it defines what the provider measures. But what the provider measures is not what the product needs to measure.

What The Provider Measures vs What You Need

The provider measures token throughput and bills accordingly. Your product measures task success. The gap between these two measurements is where cost models go wrong.

Provider sees

Product needs

Input tokens consumed

Relevant input tokens (how much was retrieval waste?)

Output tokens generated

Accepted output tokens (did the answer pass quality?)

Cache hits

Cache hits on stable prefixes (were the hits meaningful?)

Requests served

Requests that completed within SLO

Uptime

Reliability at the tail (p99, not mean)

Total usage

Usage attributable to a workload, customer, or account

A provider can report 100% uptime while your workload experiences p99 latency violations, cache miss storms, and quality regressions. The metrics live in different systems and answer different questions.

What Is Not On The Invoice

Token price captures one line on the LCPR statement of work: the provider invoice. Four other lines are typically larger combined.

  • Retries add 4-8% on most workloads, more on workloads with strict structured-output gates. The retry is invisible if you divide total spend by total requests rather than by accepted work.

  • Eval and grader calls are themselves inference, sized to the eval cadence and the grader model. A quality-sensitive workload running LLM-graded eval on 20% of traffic with a same-class grader runs roughly 5-15% of primary inference in grader cost. Deterministic checks and human review run a different cost surface but are still off the inference invoice.

  • Repairs (retry with corrected input: better context, longer window, adjusted prompt) concentrate on the failure tail. 5-12% of traffic is typical for RAG workloads where retrieval quality is uneven.

  • Human escalation usually dominates the loaded cost on quality-sensitive interactive workloads. Per-escalation cost runs 50-300x the per-request inference cost, depending on the support staffing model.

  • Operational overhead covers on-call, deployment, prompt and eval upkeep. It is a fixed cost amortized into per-request economics only if you allocate it. Most teams don't.

A workload where retries plus eval plus repair plus escalation plus ops are under 20% of inference is unusual. A workload where they are under 50% is the case where token price is a useful proxy for cost. The rest of the book is about the workloads where they are not.

The Unit Problem

The most common mistake in inference economics is choosing the wrong denominator.

Denominator

What it hides

Per million tokens

Whether those tokens produced useful work

Per request

Whether the request succeeded

Per API call

Retries, repairs, and multi-step workflows

Per user

Variance across user behavior and workload mix

Per GPU-hour

Whether the GPU produced accepted work or wasted cycles

The correct denominator depends on the workload. For a support answer system, it is “one accepted answer that passed quality, latency, and compliance gates.” For a coding agent, it is “one merged fix that passes tests and review.” For a document extraction pipeline, it is “one correctly extracted record set.”

The denominator must be defined before the cost model is built. If you cannot define what an accepted work unit is for your workload, your cost model is measuring activity, not value.

Decision rule

Do not compare inference options by token price. Compare by loaded cost per accepted work unit.

Token price is one input to the cost model. It is not the cost model.

Where this breaks
  • Very simple workloads (single-turn, low-stakes, no quality gate) where token cost genuinely dominates and retries are rare. For these workloads, token price is a reasonable proxy.

  • Batch workloads where the 50% batch discount is the dominant economic lever and quality is checked downstream.

  • Workloads where the provider bill is a rounding error relative to the product’s overall cost structure. If inference is 0.1% of operating cost, the loaded-cost exercise is not the binding constraint.

Calculator Hook

Every calculator view in this manual normalizes to accepted work units, not raw tokens. The view takes raw traces and produces loaded cost per accepted request. The sensitivity analysis shows which input dominates the result.

If you do not yet have the trace coverage to drive the calculator, jump to the twenty-trace exercise in Chapter 3. It is the minimum viable starting point and the prerequisite for everything else in Part 1.


Chapter 2: LCPR — Loaded Cost Per Result

Field Problem

Two teams at the same company both report “inference cost per request.” Team A divides the monthly provider bill by total API calls. Team B divides total loaded cost (inference + eval + human review + ops) by accepted work units. Team A reports $0.014. Team B reports $0.172 for a similar workload. Finance asks which number is correct.

Both are calculated correctly. They answer different questions. Team A’s number tells you the average inference spend per API call. Team B’s number tells you what it costs to produce one unit of accepted work. Only Team B’s number is useful for margin analysis, pricing decisions, migration comparisons, or capacity planning.

The Formula

Symbol

Unit

Meaning

C_inference

USD

Trace-derived inference cost (all calls: first attempt + retry + repair)

C_eval

USD

Eval grader cost (LLM grader calls, rubric checks)

C_human

USD

Human review, escalation, and repair labor

C_ops

USD

Amortized operational overhead allocated to this workload

delta

USD

Discrepancy between trace-derived cost and provider invoice

A

count

Accepted work units (tasks passing quality, latency, and compliance gates)

The numerator includes all costs incurred in producing the output, including the cost of failed attempts. The denominator includes only accepted output.

This is a definition, not a simplification. LCPR is what it costs to produce accepted work. The alternative is dividing inference spend by total requests, which produces a number that is accurate but not useful. Pricing snapshots, provider semantics, and cache behavior change. The formula is durable; the inputs are dated. See the assumption register in Chapter 3.

Worked Example

Anonymized: a mid-market property and casualty insurance carrier runs an internal-only model that drafts claims-adjuster guidance answers. Adjusters open a case, ask a natural-language question ("does this water-damage claim fall under the burst-pipe rider on policy form HF-3?"), and the model produces a draft response with citations to the underlying policy language. Adjusters edit and approve before any text reaches the policyholder. Roughly 11,800 queries per weekday. Three-gate validation pipeline: PII redaction, policy-rubric eval, and a compliance-regex pass that flags any draft containing boilerplate language tied to coverage decisions the model is not authorized to make.

In a six-week window the inference bill climbed 31% while case volume grew 7%. Quality pass rate was flat. Latency was stable. The team's first guess was traffic-mix shift toward longer-context cases, then a provider price change on the underlying model. Both were wrong. The trace-by-rejection-reason breakdown was the disambiguating measurement: compliance-regex rejections had climbed from 6.3% of traffic in the prior quarter to 22% in the current one. A risk-management team had widened the boilerplate regex to flag any draft mentioning "covered," "not covered," "subject to," or "pursuant to" outside of a citation block. Most flagged drafts were false positives. Each rejection paid the full input cost again on the regenerate cycle.

The denominator-collapse showed up two weeks later. Adjusters with experience on the prior tool noticed the AI was returning to the queue more often than not. They started routing around it: typing answers manually, skipping the suggestion panel entirely. A traffic audit found that once the false-positive rate exceeded 18%, daily accepted drafts dropped from roughly 9,400 to roughly 5,100. The numerator (gross inference spend) was up 31%. The denominator (accepted drafts) had collapsed to 54% of its prior level. LCPR roughly doubled, and the doubling was almost entirely from the denominator side.

Computing LCPR over a representative week, with the numerator including all failed regenerate-cycles and the denominator counting only drafts that cleared all three gates and reached an adjuster who accepted them:

  • C_inference (trace-derived, all attempts): roughly $2,840

  • C_eval (policy-rubric grader pass): roughly $173

  • C_human (review queue allocated to drafts that bypassed adjuster acceptance): roughly $1,290

  • C_ops (pipeline maintenance, allocated): roughly $164

  • A (accepted drafts that passed all three gates and reached an adjuster who accepted them): roughly 27,300 over the week

  • LCPR: roughly $0.163 per accepted draft

What Each Component Contributes

Component

Amount

% of LCPR

Inference (trace)

$2,840

63.7%

Eval grader

$173

3.9%

Human review queue

$1,290

28.9%

Ops overhead

$164

3.7%

Total numerator

$4,467

100%

The team's first reflex on seeing a 31% inference-bill jump was to negotiate a token-price discount. The breakdown said the move was a deterministic pre-validation pass that flagged drafts likely to fail the regex and routed them straight to a manual-edit path rather than regenerate: 9ms per draft, zero LLM tokens. After deployment, gross inference cost dropped 17% in two weeks. The accepted-draft count began recovering as adjusters returned to the suggestion panel; within four weeks A had climbed back toward its prior level.

The side-finding: the deterministic pre-validation surfaced a class of drafts that passed the regex but contained coverage-determination language that the original regex authors had not anticipated. Roughly 3.4% of historically accepted drafts were re-classified as needing risk-management review. The compliance team had been treating the regex as the gate. It was missing a category of language entirely. The remediation took two more quarters of regex and rubric work and was a much larger project than the cost overrun that started the investigation.

The LCPR pointed at the wrong lever first (token price). The numerator-and-denominator breakdown pointed at the right one (the deterministic gate and adjuster trust). The gate itself turned out to be incomplete in the other direction. Inference is 64% of LCPR on this workload, but the lever that moved LCPR most was not the inference cost. It was the false-positive rate driving the denominator.

The Accepted Work Unit

The denominator deserves its own definition. An accepted work unit is:

One task completion that passes the workload’s quality gate, meets its latency SLO, complies with its data-handling constraints, and does not require human intervention.

Defining the accepted work unit requires answers to four questions:

  1. What is the task? Not “an API call” or “a request.” A task is the smallest unit of work that produces value: an accepted support answer, a correctly extracted record, a merged code fix, a graded exam response.

  2. What is the quality gate? An eval score threshold, a deterministic check, a human review pass, or a downstream system acceptance. Without a quality gate, every completion counts as accepted, and the denominator is meaningless.

  3. What is the latency SLO? Interactive workloads have TTFT and E2E constraints. Batch workloads have completion-window constraints. If a response is correct but arrives after the deadline, it may not count as accepted work.

  4. What are the compliance constraints? Data residency, content logging restrictions, PII handling, retention policy. A response generated on the wrong infrastructure or stored in violation of policy is not accepted work regardless of quality.

If any of these four questions cannot be answered for a workload, the LCPR calculation will produce a number, but the number will not be useful for decisions. Define the accepted work unit before building the cost model.

Decision rule

Report inference economics as LCPR per accepted work unit. Show the component breakdown. Make the dominant cost lever visible.

If inference dominates LCPR, optimize serving. If human escalation dominates, optimize quality. If ops overhead dominates, automate or consolidate. The lever depends on the breakdown, not on the token price.

Where this breaks
  • Shared infrastructure. When multiple workloads share a dedicated endpoint, allocating C_ops and the GPU cost to individual workloads requires an allocation model. Allocation models are always wrong; the question is whether they are useful.

  • Amortized costs. Prompt engineering, fine-tuning, eval set construction, and model evaluation are upfront investments. Amortizing them into LCPR requires a time horizon and volume assumption.

  • Revenue attribution. LCPR tells you cost per accepted unit. Margin requires revenue per accepted unit. Revenue attribution for AI features embedded in larger products is often ambiguous.

  • Very low volume. At 10 requests per day, the ops overhead per request is large and unstable. LCPR is most useful when volume is high enough for the per-unit allocation to be meaningful.

Calculator Hook

The calculator view takes as input: trace data, invoice data, eval results, human review costs, and ops allocation. It produces: LCPR per accepted unit, component breakdown, sensitivity to each input, and trend over time.


Chapter 3: Trace, Invoice, Eval, and Contract

Field Problem

An engineering lead asks: “What is our inference cost?” Four people give four answers:

  • The engineer checks the provider dashboard: $14,200 per month.

  • The finance team checks the invoice: $14,850 per month.

  • The ML engineer checks the trace aggregator: $13,900 per month.

  • The product manager checks the customer contract: revenue of $52,000 per month against that spend.

Each person is looking at a different data source. Each source answers a different question. None of the four, alone, is sufficient for an inference economics decision.

The Four Data Sources

Inference economics requires joining four data sources. Each provides information the others lack. Each has failure modes the others correct.

Source 1: The Trace

The trace is the request-level event log. Each row represents one inference call: timestamps, token counts by type, latency percentiles, model, provider, route, cache state, retry count, tool calls, and outcome.

What the trace provides: - Request-level cost attribution (which workload, which customer, which route) - Latency breakdown (TTFT, TPOT, queue time, tool time) - Cache behavior (hit/miss, prefix length, TTL) - Failure classification (timeout, rate limit, quality fail, malformed output) - Retry and repair chains (linking retries to their original request)

What the trace misses: - Rounding and pricing adjustments applied by the provider - Credits, commitments, and contract terms - Costs outside the inference call (human review, ops overhead) - Whether the provider’s token count matches your tokenizer’s count

Traces fail in characteristic ways: sampling or logging gaps drop calls (especially async tool calls and retry chains); the cost calculation uses a stale pricing snapshot from before a provider change; the trace's token count disagrees with the provider's tokenizer; and model-name aliases ("claude-sonnet-4" in the trace, "claude-sonnet-4-20260415" on the invoice) break the join. The model-name alias issue eats the most time in practice and is the one to instrument first.

Source 2: The Invoice

The invoice is the provider’s billing record. It is the ground truth for what was charged, but not for why.

What the invoice provides: - Actual charges after rounding, minimums, batch adjustments, and credits - Billing period boundaries - Committed spend tracking - Tax and compliance line items

What the invoice misses: - Request-level attribution (which workload, which customer) - Why costs moved (was it volume, model mix, cache miss rate, or output length?) - Latency, quality, or reliability information - Costs from other providers in a multi-source architecture

Invoices fail in characteristic ways: end-of-month usage posts to the following invoice and breaks trace-invoice reconciliation across the boundary; aggregate line items combine workloads that traces separate; invoice model aliases do not match trace model names; and credits apply inconsistently across line items, so the charged-amount displayed on the dashboard can be wrong by a credit cycle.

Source 3: The Eval

The eval is the quality measurement. It answers whether the inference output was accepted work or failed work.

What the eval provides: - Pass/fail determination per output - Quality score distribution - Failure clustering (which failure modes are common) - Repair rate and repair success rate - The denominator for LCPR

What the eval misses: - Cost information - Latency information (unless combined with trace) - Whether the eval itself is calibrated and reliable - Long-term quality drift vs point-in-time score

Evals fail in characteristic ways: the LLM grader and a human reviewer disagree on the same outputs at rates above 10%; the eval set was built during initial development and no longer reflects production failure-mode frequency; the quality threshold is set low enough that everything passes (and real failures are invisible) or high enough that the workload triggers repairs and escalations it does not need; and eval coverage is partial, with a 5-10% sample on a long-tail workload missing the tail entirely.

Source 4: The Contract

The contract is the commercial agreement between your organization and the inference provider, or between your organization and your customer.

What the contract provides: - Revenue per unit of work (for margin calculation) - Committed spend and credit balances - SLA terms and penalty triggers - Data handling and residency requirements - Rate limits and burst allowances - Overage pricing

What the contract misses: - Actual usage patterns - Quality or latency performance - Whether committed spend is being utilized efficiently - Whether the customer is actually using the features that generate inference cost

Contracts fail in characteristic ways: committed spend does not match actual usage patterns, so over-commit burns credits and under-commit pays overage rates; credits mask the true run-rate cost on the dashboard; SLA terms do not match the product's actual reliability profile; and revenue recognition is out of sync with inference-cost timing, so margin reads correctly only at a quarterly aggregate.

Joining The Four Sources

None of the four sources is sufficient alone. Inference economics requires joining them:

The join is difficult because:

  • Traces and invoices use different time boundaries.

  • Traces have request-level granularity; invoices have aggregate line items.

  • Evals may cover a sample, not the full population.

  • Contracts define terms that change the economics but do not produce telemetry.

The join produces a number. But every cost model depends on inputs, and not all inputs are equally trustworthy. A team builds a cost model, puts LCPR = $0.14 and margin = 45% into a quarterly review deck, and six months later margin has dropped to 28%, because three assumptions were wrong from the start: cache hit rate was 35-40% not 60%, output length p50 was 340 tokens not 250, and human escalation rate was 5.5% not 3%. Nobody tracked these assumptions. Nobody owned them. Nobody checked them.

The Assumption Register

Every LCPR calculation depends on inputs. Some inputs are measured. Some are assumed. The assumption register makes the distinction explicit.

Input

Value

Confidence

Source

Owner

Last verified

Refresh trigger

Input token price (Route A)

$3.00/MTok

PUBLIC

Provider pricing page

Infra

2026-05-12

Provider price change

Cache hit rate

60%

ASSUMED

Design target

ML Eng

Never measured

Monthly trace review

Output length p50

250 tok

ASSUMED

Prompt test (n=50)

ML Eng

2026-04-01

Monthly trace review

Eval pass rate

85%

MEASURED

Eval pipeline output

ML Eng

2026-05-10

Weekly

Human escalation rate

3%

ASSUMED

Industry estimate

Support

Never measured

Monthly

Ops overhead/day

$25

ESTIMATED

Eng time allocation

Eng Mgr

2026-03-15

Quarterly

The columns that matter most:

  • Confidence: MEASURED (from production telemetry), PUBLIC (from official source), ESTIMATED (calculated from limited data), ASSUMED (no direct measurement), CONTRACTED (from legal agreement).

  • Last verified: The date someone checked this input against reality. “Never measured” is a valid and important entry.

  • Refresh trigger: What event should cause this input to be re-checked.

Confidence Levels

Not all inputs are equally trustworthy. Confidence labels prevent a cost model from treating a pricing-page number and an industry-average guess with equal weight.

Level

Meaning

Example

Risk

MEASURED

Observed from production telemetry, reproducible

Cache hit rate from 30 days of traces

Low—if measurement is correct

PUBLIC

Stated by provider in official documentation

Token price from pricing page

Low—until provider changes it

CONTRACTED

In a signed agreement

Committed spend, SLA penalties

Low—unless contract is renegotiated

ESTIMATED

Derived from partial data or short observation

Output length from a 50-request sample

Medium—sample may not represent production

ASSUMED

No direct measurement, based on belief or analogy

Human escalation rate from industry report

High—assumption may be wrong from day one

UNKNOWN

Not yet investigated

Cache-breaking event rate

Critical—the input could be anything

A cost model with three ASSUMED inputs and two UNKNOWN inputs is not wrong. It is incomplete. The assumption register makes the incompleteness visible so the team knows where to invest in measurement.

Data-Quality Conditions

Before a cost model is used for a decision (migration, pricing change, capacity reservation, vendor negotiation), five conditions should hold.

Coverage. Trace coverage above 95%. Missing traces are not random; they correlate with timeouts, errors, and edge cases, so a 90%-coverage trace systematically under-reports the cost of the tail.

Freshness. Pricing snapshot, cache hit rate, and eval pass rate all dated within their refresh trigger. Stale inputs flow through unchanged and silently lie.

Reconciliation. Trace-derived cost within 5% of invoice for the same period. If the gap is wider, the trace is not trustworthy for cost modeling until the gap is explained.

Denominator. The accepted work unit defined and the quality gate running. Without an eval, the denominator is "requests" rather than "accepted work" and LCPR is understated.

Confidence floor. Fewer than two key inputs in ASSUMED or UNKNOWN status. Above that threshold, the model is a hypothesis. Label it that way in the deck and use it for scoping, not for commitments.

The Unknown Bucket

Every workload has traffic that cannot be attributed, classified, or evaluated. Requests with missing workload IDs. Calls from deprecated integrations. Test traffic mixed into production. Retry chains where the original request trace was lost.

The honest response is not to ignore this traffic. It is to create an unknown bucket and measure its size.

Metric

Value

Action threshold

Requests with missing workload_id

4%

Investigate if > 5%

Requests with missing eval result

12%

Investigate if > 10%

Unattributed cost (trace vs invoice gap)

3%

Investigate if > 5%

Traffic from deprecated routes

1%

Clean up

If the unknown bucket is small (under 5% of cost), it is noise. If it is large (over 10%), it is corrupting the cost model. Unknown traffic that correlates with high-cost or low-quality behavior is especially dangerous because it hides the workload’s real economics.

The First Twenty Traces

When a team has no cost model, no eval, and no trace-to-invoice reconciliation, the starting point is not a dashboard or a calculator. It is twenty traces.

Pull twenty real requests from production. For each request, answer:

  1. What workload is this? Can you name it?

  2. What was the input? How many tokens? Was any of it cached?

  3. What was the output? How many tokens? Did it pass quality?

  4. What did it cost? Can you calculate the cost from the pricing snapshot?

  5. Did it meet the latency SLO?

  6. If it failed, what happened next? Retry? Repair? Escalation? Nothing?

Twenty traces will reveal:

  • Whether you can attribute requests to workloads

  • Whether your token counts match the provider’s

  • Whether your cache is hitting

  • Whether your quality gate is running

  • Whether retries and repairs are visible in the trace

If you cannot answer these questions for twenty requests, you cannot build a cost model. The twenty-trace exercise is the minimum viable data quality gate.

What To Measure

For each workload, the minimum viable join requires:

Field

Source

Purpose

request_id

Trace

Link requests to outcomes

workload_id

Trace

Attribute cost to workload

model + provider

Trace + Invoice

Reconcile model aliases

input_tokens, output_tokens, cached_tokens

Trace

Calculate trace-derived cost

ttft_ms, e2e_ms

Trace

SLO compliance

eval_result

Eval

Accepted vs failed

invoice_line_item

Invoice

Actual charge

revenue_per_unit

Contract

Margin calculation

credit_balance

Contract

True cost after credits

Decision rule

Build the four-source join for one workload first. Pick the workload with the highest spend or the most quality variance. Join one day of traces to the corresponding invoice period. Calculate LCPR. Measure the delta between trace-derived cost and invoice. If the delta is under 5%, the trace is trustworthy for daily monitoring. If it exceeds 5%, investigate before relying on trace-based cost reporting.

Before using the cost model for a decision (migration, pricing change, capacity reservation, vendor negotiation), check the assumption register. If more than two key inputs are ASSUMED or UNKNOWN, the model is directional, not reliable. Label it as such in the deck. Use it for scoping, not for commitments.

Upgrade assumptions to measurements by investing in trace coverage, eval pipeline, and invoice reconciliation, in that order. Trace coverage is first because without traces, you cannot measure anything else.

Where this breaks
  • Multi-provider architectures where traces span multiple providers and invoices come from separate billing systems.

  • Real-time streaming where token counts are approximate until the stream completes.

  • Shared tenancy where one provider account serves multiple internal workloads and the invoice does not break down by workload.

  • Rapidly changing prices where the pricing snapshot used for trace-derived cost is stale by the time the invoice arrives.

  • Startup-stage teams that do not have enough volume for statistical confidence. At 50 requests per day, weekly metrics are noisy. Monthly aggregation may be the minimum useful window.

  • Multi-model workflows where a single task uses multiple models in sequence (retrieval, generation, grading, repair). The trace needs to link these into a single task lifecycle, which requires trace correlation.

  • Rapidly changing workloads where the distribution shifts faster than the refresh cycle. A cache hit rate measured last month may be wrong this week if the retrieval context or prompt structure changed.

  • Organizational silos where traces live in engineering, invoices live in finance, evals live in ML, and contracts live in legal. The join requires cross-functional access that many organizations do not have.

Calculator Hook

The calculator view performs the four-source join. Input: trace export, invoice CSV, eval results, contract terms. Output: LCPR, margin, delta analysis, variance waterfall showing which input changed and how much it moved the result.

The calculator includes an assumption register tab. Each input has a confidence field. The sensitivity analysis highlights which ASSUMED inputs would change the LCPR by more than 10% if the real value differs from the assumption. Those are the inputs worth measuring first.


Part 1 Summary

Part 1 established three ideas:

  1. Token price is not cost. The provider bill is one component. Retries, repairs, eval, human escalation, and operational overhead are the rest. On quality-sensitive workloads, non-token costs can dominate.

  2. LCPR is the metric. Loaded cost per result, normalized to accepted work units. The formula is simple. The difficulty is in the inputs: trace coverage, eval calibration, invoice reconciliation, and overhead allocation.

  3. Four sources, one join, and a confidence gate. Traces provide attribution. Invoices provide ground truth. Evals provide the denominator. Contracts provide the revenue and constraints. No single source is sufficient. Every input to the cost model has a confidence level. ASSUMED inputs are hypotheses, not measurements. The twenty-trace exercise is the minimum viable starting point.

The next part is Serving Physics: the mechanisms behind the numbers. Output tokens cost more than input tokens; batch size sets the cost-latency frontier; cache hit rate is an economic variable; context length is a memory product. The economics in Part 1 depend on the physics in Part 2.


Evidence Notes for Part 1

Claim

Type

Source

Provider billing grammar structure

Public

Provider pricing pages and API documentation

LCPR formula and components

Derived

Constructed from standard cost accounting applied to inference

Worked example (insurance carrier claims-adjuster guidance, LCPR roughly $0.163)

Anonymized

Anonymized scenario; numbers fabricated but representative of the pattern

Human review queue cost

Illustrative

Order of magnitude; actual cost depends on review staffing model

Cache hit rate degradation pattern

Reported

Shaped by production observations of prefix-cache eviction under multi-tenant load

Invoice-trace delta patterns

Reported

Common reconciliation issues from production deployments

Confidence level taxonomy

Derived

Standard data quality framework applied to inference inputs

Output token length variance across models

Public

Observable from model documentation and evaluation

The Chapter 2 worked example uses an anonymized scenario; numbers are fabricated but representative of the pattern. Other examples use synthetic numbers shaped by real provider semantics. No numbers should be attributed to any specific employer, customer, or deployment.