How To Use This Book
Who This Is For
A field manual for people who own production inference decisions: what runs where, what it actually costs, when the architecture is wrong. The primary reader is an engineer or technical leader running real workloads. The secondary reader is a finance or operations partner who needs the cost model without the GPU detail; Parts 1, 3, and 5 are written for them.
The book assumes you can read a formula and know what p99 means. It does not assume you know KV cache, the prefill/decode asymmetry, or roofline analysis. Part 2 teaches those from first principles.
Sources and Numbers
Factual claims about pricing, throughput, capability, or provider behavior carry inline sources: a URL, a citation, or a dated snapshot in the appendix. Synthetic or illustrative material (numbers built to teach a mechanism rather than predict your bill) is called out in prose at the point of use. Prices change, models change, contracts change; if a number in this book would change your decision, check the snapshot date and re-verify.
Reading Paths
Parts 1 and 5 are useful to engineering, product, and finance partners together. Part 2 is the engineering foundation: if your team hasn't reasoned about roofline or KV capacity quantitatively, start there. Part 3 maps workloads to constraints; Part 4 maps options to operational commitment. Most readers will go nonlinear. The table below is a cross-reference, not a curriculum.
Working on |
Start with |
Then |
|---|---|---|
Cost reduction |
Opener, Part 1 (loaded cost) |
|
Vendor evaluation |
Part 1 (trace/invoice/eval/contract) |
|
Serverless vs dedicated |
Part 2 (roofline, batch frontier) |
Part 4 (utilization break-even) |
Bill investigation |
Part 1 (assumption register) |
Part 5 (observability, incidents) |
Closed-API migration |
Part 4 (candidate funnel, migration gates) |
Part 4 (reversion signals) |
Agentic / coding-agent systems |
Opener, Part 2 (prompt caching) |
Part 3 (agentic workloads, fanout) |
Monthly technical/financial review |
Part 1 (economic unit) |
Part 5 (inference review pack) |
The Calculator
The calculator runs the math. The book teaches the mechanism and the decision rule; the calculator carries the repeatable computations and sensitivity analysis. Every worked example has a corresponding seed: a YAML file with inputs, assumptions, and expected outputs. Load a seed, change inputs to match your workload, see what moves.
Views referenced in the book:
LCPR: loaded cost per result, component breakdown, sensitivity.
goodput frontier: accepted work per second under SLO constraints.
trace-to-invoice reconciliation: four-source join from traces, invoices, evals, and contracts.
workload-to-route mapping: workload identity mapped to route candidates.
cache break-even: write/read/TTL economics by provider.
KV capacity: concurrent sequences by context length and model config.
dedicated utilization gate: break-even utilization for dedicated vs serverless.
Every output shows the pricing snapshot date, the inline sources, and which inputs dominate. The calculator doesn't replace measurement; it makes the assumptions explicit. If a number would change your decision, verify it.