← Back

Inference Economics

Production inference costs more than your token bill suggests. This work develops the math, the diagnostics, and the tools to measure what inference actually costs --- loaded cost per accepted result, not price per million tokens.

The Book

Production Inference Economics: A Field Guide is the canonical reference. Twenty-seven chapters across five parts: the economic unit (LCPR), serving physics, workload economics, migration gates, and operating the decision. Covers how to measure, model, and operate production inference so you pick the cheapest reliable architecture that still meets quality, latency, and reliability requirements. Start at the Opener and read straight through, or use Part 0 to pick a reading path.

The Long Essay

The Honest Field Guide to Production Inference is the single-essay version: TCO frameworks, vendor evaluation, architecture patterns, and a staged playbook from API to dedicated GPU. Read this if you want the framework in one sitting.

Production Inference Economics Series

Three articles develop the measurement methodology in depth. Each stands alone; together they form a sequence.

  1. The Denominator Problem --- The most common mistake in inference economics is dividing by the wrong number. Loaded cost per result (LCPR) exposes the gap between naive token cost and actual production cost on a quality-sensitive workload.

  2. The LCPR Calculator --- Open-source calculator implementing the four-source join (trace + invoice + eval + contract) as code. Three worked examples, cache break-even analysis, and KV memory sizing.

  3. What Your Workload Actually Costs --- Not all inference is the same. Per-workload LCPR exposes the cross-subsidy that blended averages hide, with cost models for conversational, agentic, RAG, extraction, voice, and batch workloads.

Companion Pieces

Two longer essays expand on specific chapters of the book.

Tools

LCPR Calculator --- Interactive Streamlit app for LCPR comparison, sensitivity analysis, break-even analysis, migration readiness, and goodput frontier testing. Source code on GitHub.