How To Use This Book

Who This Is For

A field manual for people who own production inference decisions: what runs where, what it actually costs, when the architecture is wrong. The primary reader is an engineer or technical leader running real workloads. The secondary reader is a finance or operations partner who needs the cost model without the GPU detail; Parts 1, 3, and 5 are written for them.

The book assumes you can read a formula and know what p99 means. It does not assume you know KV cache, the prefill/decode asymmetry, or roofline analysis. Part 2 teaches those from first principles.

Sources and Numbers

Factual claims about pricing, throughput, capability, or provider behavior carry inline sources: a URL, a citation, or a dated snapshot in the appendix. Synthetic or illustrative material (numbers built to teach a mechanism rather than predict your bill) is called out in prose at the point of use. Prices change, models change, contracts change; if a number in this book would change your decision, check the snapshot date and re-verify.

Reading Paths

Parts 1 and 5 are useful to engineering, product, and finance partners together. Part 2 is the engineering foundation: if your team hasn't reasoned about roofline or KV capacity quantitatively, start there. Part 3 maps workloads to constraints; Part 4 maps options to operational commitment. Most readers will go nonlinear. The table below is a cross-reference, not a curriculum.

Working on	Start with	Then
Cost reduction	Opener, Part 1 (loaded cost)	Part 2 (cost levers), Part 3 (your workload)
Vendor evaluation	Part 1 (trace/invoice/eval/contract)	Part 2 (productive capacity), Part 5 (benchmark hygiene)
Serverless vs dedicated	Part 2 (roofline, batch frontier)	Part 4 (utilization break-even)
Bill investigation	Part 1 (assumption register)	Part 5 (observability, incidents)
Closed-API migration	Part 4 (candidate funnel, migration gates)	Part 4 (reversion signals)
Agentic / coding-agent systems	Opener, Part 2 (prompt caching)	Part 3 (agentic workloads, fanout)
Monthly technical/financial review	Part 1 (economic unit)	Part 5 (inference review pack)

The Calculator

The calculator runs the math. The book teaches the mechanism and the decision rule; the calculator carries the repeatable computations and sensitivity analysis. Every worked example has a corresponding seed: a YAML file with inputs, assumptions, and expected outputs. Load a seed, change inputs to match your workload, see what moves.

Views referenced in the book:

LCPR: loaded cost per result, component breakdown, sensitivity.
goodput frontier: accepted work per second under SLO constraints.
trace-to-invoice reconciliation: four-source join from traces, invoices, evals, and contracts.
workload-to-route mapping: workload identity mapped to route candidates.
cache break-even: write/read/TTL economics by provider.
KV capacity: concurrent sequences by context length and model config.
dedicated utilization gate: break-even utilization for dedicated vs serverless.

Every output shows the pricing snapshot date, the inline sources, and which inputs dominate. The calculator doesn't replace measurement; it makes the assumptions explicit. If a number would change your decision, verify it.