Inference Economics - Sohail Mohammad

Production inference costs more than your token bill suggests. This work develops the math, the diagnostics, and the tools to measure what inference actually costs --- loaded cost per accepted result, not price per million tokens.

The Book¶

Production Inference Economics: A Field Guide is the canonical reference. Twenty-seven chapters across five parts: the economic unit (LCPR), serving physics, workload economics, migration gates, and operating the decision. Covers how to measure, model, and operate production inference so you pick the cheapest reliable architecture that still meets quality, latency, and reliability requirements. Start at the Opener and read straight through, or use Part 0 to pick a reading path.

The Long Essay¶

The Honest Field Guide to Production Inference is the single-essay version: TCO frameworks, vendor evaluation, architecture patterns, and a staged playbook from API to dedicated GPU. Read this if you want the framework in one sitting.

Production Inference Economics Series¶

Three articles develop the measurement methodology in depth. Each stands alone; together they form a sequence.

The Denominator Problem --- The most common mistake in inference economics is dividing by the wrong number. Loaded cost per result (LCPR) exposes the gap between naive token cost and actual production cost on a quality-sensitive workload.
The LCPR Calculator --- Open-source calculator implementing the four-source join (trace + invoice + eval + contract) as code. Three worked examples, cache break-even analysis, and KV memory sizing.
What Your Workload Actually Costs --- Not all inference is the same. Per-workload LCPR exposes the cross-subsidy that blended averages hide, with cost models for conversational, agentic, RAG, extraction, voice, and batch workloads.

Companion Pieces¶

Two longer essays expand on specific chapters of the book.

Trace Autopsy --- A repeatable diagnostic for going from raw trace events to loaded cost per accepted result. Companion to Chapter 25 of the book (which develops the method on a different anonymized regulated-workload scenario).
Goodput or It Didn't Happen --- GPU utilization can be 78% while 30% of requests fail SLO constraints. The goodput frontier replaces single-number benchmarks with decision-grade surfaces. Excerpted from Chapter 9 of the book.

Tools¶

LCPR Calculator --- Interactive Streamlit app for LCPR comparison, sensitivity analysis, break-even analysis, migration readiness, and goodput frontier testing. Source code on GitHub.