How To Use This Book


Who This Is For

A field manual for people who own production inference decisions: what runs where, what it actually costs, when the architecture is wrong. The primary reader is an engineer or technical leader running real workloads. The secondary reader is a finance or operations partner who needs the cost model without the GPU detail; Parts 1, 3, and 5 are written for them.

The book assumes you can read a formula and know what p99 means. It does not assume you know KV cache, the prefill/decode asymmetry, or roofline analysis. Part 2 teaches those from first principles.


Sources and Numbers

Factual claims about pricing, throughput, capability, or provider behavior carry inline sources: a URL, a citation, or a dated snapshot in the appendix. Synthetic or illustrative material (numbers built to teach a mechanism rather than predict your bill) is called out in prose at the point of use. Prices change, models change, contracts change; if a number in this book would change your decision, check the snapshot date and re-verify.


Reading Paths

Parts 1 and 5 are useful to engineering, product, and finance partners together. Part 2 is the engineering foundation: if your team hasn't reasoned about roofline or KV capacity quantitatively, start there. Part 3 maps workloads to constraints; Part 4 maps options to operational commitment. Most readers will go nonlinear. The table below is a cross-reference, not a curriculum.

Working on

Start with

Then

Cost reduction

Opener, Part 1 (loaded cost)

Part 2 (cost levers), Part 3 (your workload)

Vendor evaluation

Part 1 (trace/invoice/eval/contract)

Part 2 (productive capacity), Part 5 (benchmark hygiene)

Serverless vs dedicated

Part 2 (roofline, batch frontier)

Part 4 (utilization break-even)

Bill investigation

Part 1 (assumption register)

Part 5 (observability, incidents)

Closed-API migration

Part 4 (candidate funnel, migration gates)

Part 4 (reversion signals)

Agentic / coding-agent systems

Opener, Part 2 (prompt caching)

Part 3 (agentic workloads, fanout)

Monthly technical/financial review

Part 1 (economic unit)

Part 5 (inference review pack)


The Calculator

The calculator runs the math. The book teaches the mechanism and the decision rule; the calculator carries the repeatable computations and sensitivity analysis. Every worked example has a corresponding seed: a YAML file with inputs, assumptions, and expected outputs. Load a seed, change inputs to match your workload, see what moves.

Views referenced in the book:

  • LCPR: loaded cost per result, component breakdown, sensitivity.

  • goodput frontier: accepted work per second under SLO constraints.

  • trace-to-invoice reconciliation: four-source join from traces, invoices, evals, and contracts.

  • workload-to-route mapping: workload identity mapped to route candidates.

  • cache break-even: write/read/TTL economics by provider.

  • KV capacity: concurrent sequences by context length and model config.

  • dedicated utilization gate: break-even utilization for dedicated vs serverless.

Every output shows the pricing snapshot date, the inline sources, and which inputs dominate. The calculator doesn't replace measurement; it makes the assumptions explicit. If a number would change your decision, verify it.