Serving Physics

Chapters 4–9

Part 1 defined the economic unit: loaded cost per accepted result, not token price. Part 2 explains why the numbers come out the way they do. The mechanisms here (roofline reasoning, batch amortization, KV cache pressure, prefill/decode asymmetry, prompt caching, and productive capacity) are the vocabulary that Parts 3 through 5 depend on.

A reader who skips Part 2 can still use LCPR and the calculator. But they will not know why cache hit rate changes cost, why output tokens cost more than input tokens, why batch size sets the latency-cost frontier, or why “GPU utilization” can be high while productive work is low. The physics explains the economics.


Chapter 4: The Hardware Cost Floor

Field Problem

The benchmark says 4,000 tokens per second. Production delivers 800. The 7B model is deployed on an A100 80GB, same hardware the benchmark used. The first hypothesis is misconfiguration. The second is that the benchmark was fraudulent. Both are wrong. The benchmark ran at batch 256 with synthetic short prompts and uniform 50-token outputs. Production runs at batch 4-8 with mixed prompt lengths, p99 latency constraints, and quality gates that reject 10% of outputs.

The benchmark measured peak throughput at a point on the roofline that production never reaches. The gap is not a bug. It is the shape of inference physics.

Mechanism

The roofline model comes from high-performance computing. For any computation, performance is bounded by two ceilings: compute throughput and memory bandwidth. The computation cannot exceed either ceiling. Where the workload sits relative to these ceilings determines what limits it.

For a single decode step in a dense transformer:

T_step ≥ max(W_active / BW_eff + N_batch × M_kv / BW_eff, N_batch × O_tok / F_eff)

Where:

Symbol

Unit

Meaning

B

sequences

Active decode batch size

W_active

bytes

Active weight bytes read per step

BW_eff

bytes/s

Effective sustained memory bandwidth

F_eff

ops/s

Effective sustained compute throughput

O_tok

ops/token

Operations per generated token

M_kv

bytes/token

KV bytes read per token at current context

Two things to notice. First, this is a max(), not a sum. The bottleneck is whichever ceiling is lower for the current workload. Second, W_active appears once regardless of batch size: weight bytes are read once per step and shared across the batch. N_batch × M_kv grows with batch because each live sequence has its own KV state.

Naive Answer

“My GPU has 312 TFLOPS. My model does X FLOPs per token. Therefore throughput is 312T / X.”

This ignores memory bandwidth entirely. Most production decode serving is not compute-bound. At the batch sizes where latency SLOs are met, the system is reading weights and KV state from HBM faster than it can compute. The GPU’s compute units often wait for data.

Better Model

Inference serving lives in three regimes:

Memory-bandwidth-bound (small batch). Weight fetch dominates. Each decode step reads the full active weights from HBM. Cost per token is high because few sequences share the weight read. This is where most latency-sensitive serving operates.

Transitional (medium batch). Weight fetch is partially amortized. KV fetch is growing. Neither compute nor bandwidth clearly dominates. This is often the operating sweet spot under SLO constraints.

Compute-bound (large batch). Weight fetch is fully amortized over the batch. Compute per token becomes the floor. Cost per token stops falling. But latency rises because more tokens compete for compute and KV memory.

The roofline does not tell you which regime is “best.” The SLO, workload shape, and quality gate determine where on the roofline you can operate.

Worked Example

Setup: Dense 7B model, FP16 weights (14 GB), A100 80GB SXM.

Parameter

Value

Source

W_active

14 GB

Model config × 2 bytes/param

BW_eff

1,800 GB/s

Sustained, ~88% of peak 2,039 GB/s

F_eff

250 TFLOPS

Sustained FP16, ~80% of peak 312

O_tok

~14 GFLOPs/token

2 × params for decode (approximate)

M_kv

~0.5 MB/token at 2K context

Model-specific

c_accel

$0.00044/s

~$1.59/hr on-demand

p_pass

0.90

90% accepted

Batch

Bottleneck

Raw cost/MTok

Accepted cost/MTok

1

Memory (weight fetch)

~$3.44

~$3.82

8

Memory (amortizing)

~$0.49

~$0.54

32

Transitional

~$0.19

~$0.21

128

Compute-bound

~$0.13

~$0.14

256

KV pressure rising

~$0.13

~$0.14 (if SLO holds)

All numbers are illustrative. Actual throughput depends on scheduler, quantization, tensor parallelism, prompt/output mix, and prefix caching.

Cost falls about 27x from batch 1 to batch 128 in this worked example, then stops falling. The benchmark's 4,000 tok/s came from batch 256+. Production at batch 4-8 lives in the expensive part of the curve, by design: that is where latency SLOs are met.

A real-time conversational system with a 1,500ms end-to-end budget allocates roughly half to model inference (300-800ms depending on output length). When p99 inference latency drifts 5-10% per hour, the drift is invisible against a stable mean for a long time: a 600ms p99 climbing to 720ms still passes most aggregated dashboards. On one workload I helped diagnose, the drift went unnoticed for about 36 hours because the alarms were thresholded on p99 absolute, not p99 trend. The fix took ten minutes (pin the KV pool refresh interval), but the detection took a day and a half. Average and even p99-absolute can mask compounding behavior in a stateful serving system. Add trend-based alarms on p99 and KV utilization, not just absolute thresholds.

The Batch Size Frontier

The roofline sets the physics. The batch policy determines the economics, and it is often invisible in the token price.

A platform engineer pitches a switch: "this serverless endpoint is 3x cheaper per token." The team migrates and the bill drops. p99 latency doubles. Support tickets about "slow AI" rise. The mistake was comparing token prices without understanding that the cheap endpoint achieves low cost by running large batches, trading latency for throughput.

This is not fraud. This is the batch size frontier: every serving system makes a choice about where on the cost-latency curve to operate, and that choice is invisible in the token price.

From the roofline, weight fetch (W_active) amortizes across the batch: each additional sequence shares the same weight read. KV fetch (N_batch × M_kv) does not amortize: it grows with batch because each sequence brings its own context. After weight fetch is amortized, compute becomes the floor.

The consequence: a provider running at large batch gets cheaper cost per token. A provider running at small batch gets lower latency. The token price reflects the batch policy, not just the hardware or model.

The bus-versus-taxi analogy fits: shared vehicles are cheaper per passenger-mile, but you wait for departure and stops; the taxi is expensive but leaves on your timing. The question is not which is cheaper. It is which serves the passenger's time constraint.

“The cheap provider is just more efficient.”

Maybe. But if the cheap provider achieves 3x better cost by running at 4x larger batch, the latency tradeoff is baked in. The provider’s efficiency is real, but it is not free. The cost was paid in latency and potentially in tail-latency variance.

The batch size frontier has three zones:

Zone 1: Weight-bound. Small batch. Weight fetch dominates. Cost per token is high. Latency per token is low (few tokens compete for resources). This is where real-time voice, interactive chat with strict TTFT requirements, and low-concurrency dedicated endpoints operate.

Zone 2: Amortization sweet spot. Medium batch. Weight fetch is amortized. KV fetch is growing but manageable. Compute has not yet become the floor. This is where most production serverless APIs operate. The token price reflects partial amortization.

Zone 3: Compute floor. Large batch. Weight fetch is negligible. Compute per token is the binding constraint. Cost per token stops falling. But TPOT rises because decode steps take longer when the batch is large. TTFT can rise sharply if prefill requests queue behind a large decode batch. This is where offline batch processing and some throughput-optimized serverless tiers operate.

The provider’s published token price reflects which zone they target. A low token price usually means Zone 2 or 3. A high token price with tight latency guarantees usually means Zone 1. Neither is wrong. The question is which zone fits the workload’s latency and quality requirements.

Using the same 7B model:

Zone

Batch

Cost/MTok (accepted)

p99 TPOT

Use case fit

Weight-bound

1-4

$1.90-$3.82

8-32ms

Voice, real-time

Amortization

8-64

$0.14-$0.54

32-80ms

Interactive chat

Compute floor

128-512

$0.13-$0.14

80-200ms+

Batch, offline

Illustrative numbers; exact figures depend on hardware, model, scheduler, and SLO.

The cost difference between Zone 1 and Zone 3 is ~27x. That gap explains why “same model, different provider, different price” is not necessarily unfair. It may reflect different batch policy choices.

Decision rule

The roofline determines where your workload sits on the cost-latency frontier. Do not plan capacity from peak throughput. Plan from the batch size where your p99 latency starts violating your SLO. Calculate cost at that operating point.

When comparing providers or deployment modes, ask: at what batch size does this system operate under my traffic load, and does the resulting latency meet my SLO? The cheapest provider that violates your latency SLO is not cheaper. It is more expensive, because the latency failures create retries, timeouts, or customer dissatisfaction that increase LCPR.

What To Measure

  • Sustained memory bandwidth for your GPU/engine (not peak spec)

  • Active decode batch size under production traffic

  • TTFT and TPOT distribution at each batch size in a sweep

  • The batch size where p99 latency crosses your SLO threshold

  • Cost per accepted output token at the operating batch size

  • Latency variance (p50 vs p99) as traffic changes batch size

Where this breaks
  • The roofline’s max() is a teaching model. Real serving kernels overlap compute and memory to varying degrees. The roofline is a bound, not a prediction.

  • Quantization, tensor parallelism, speculative decoding, and MoE change the effective arithmetic intensity and shift the transition point.

  • Prefill and decode have different roofline shapes. This derivation covers decode only. Prefill can be compute-bound at much smaller batch sizes.

  • KV memory capacity can force smaller batches even when the roofline says larger batches would be cheaper.

  • Continuous batching complicates the picture. Modern serving engines do not run fixed batches. They add and remove sequences dynamically at each decode step. The “batch size” is a running average, not a constant.

  • Multi-model routing can mean different workloads see different effective batch sizes on the same hardware.

  • Queue time can dominate latency at high utilization, independent of batch size.

Calculator Hook

The with serving physics analysis takes hardware cost, measured bandwidth/compute, weight bytes, SLO constraints, and traffic rate. It outputs the cost curve by batch size, the memory vs compute regime boundary, the operating zone for your traffic, and infeasible batch points.


Chapter 5: Prefill, Decode, and Why Output Tokens Cost More

Field Problem

The prompt is long: 8,000 tokens of retrieved context plus a 500-token system prompt. The output is short, typically 150-300 tokens. This RAG pipeline’s team sees that output tokens cost 3-5x more than input tokens on most provider pricing pages. They assume this is provider markup. It is not markup. It is physics.

Mechanism

Inference has two phases:

Prefill processes the entire input prompt in one forward pass (or chunked passes). All input tokens are processed in parallel. Prefill can saturate the GPU’s compute resources at relatively small batch sizes because the parallelism comes from the prompt length, not the number of concurrent sequences. Prefill is often compute-bound for long prompts.

Decode generates output tokens one at a time, autoregressively. Each decode step produces one token per sequence. The next token depends on all previous tokens through the KV cache. Decode is sequential. At moderate batch sizes, decode is often memory-bandwidth-bound because each step reads the model weights and KV state but produces relatively little compute work per step.

The economic consequence is that output tokens are structurally more expensive than input tokens. A rough cost ratio for a dense transformer is:

cost_out / cost_in ≈ (FLOPs_per_decode_token / FLOPs_per_prefill_token) × (1 / batch_amortization)

Provider pricing reflects this asymmetry. When a provider charges 3x for output tokens, they are not adding 3x margin. They are passing through the structural cost difference between prefill and decode.

Naive Answer

“Output tokens cost more because providers want higher margins on generation.”

Providers do have pricing strategy. But the output/input price ratio (typically 2-6x on major API providers as of the May 2026 pricing snapshot) is broadly consistent across providers with very different business models. The ratio varies: some providers price output at 3x input, others at 4x or 6x, and some serverless open-model endpoints use symmetric pricing. The ratio persists across most providers because it approximates the structural cost difference between prefill and decode, though product strategy and market positioning also influence it.

Better Model

Decompose the user-visible latency:

T_user = TTFT + N_out × TPOT

TTFT depends mostly on prefill. Long prompts mean longer TTFT. TPOT depends on decode. Each output token adds one decode step. Output-heavy workloads (chatbots generating long answers, code agents producing files) spend most of their wall time and GPU time in decode.

For economics: input tokens have low marginal cost when prefix-cached (the KV is served from memory, prefill is skipped). Output tokens always pay the decode cost. Cached input tokens can be 90% cheaper than uncached. Output tokens cannot be cached in the same way—they do not exist until generated.

The phase asymmetry is not academic. Operators serving large MoE models increasingly deploy separate hardware for each phase: one search infrastructure provider uses tensor-parallel prefillers (TP=4) because prefill is compute-bound, and data-parallel decoders spread across up to 16 GPUs (EP=16) because decode is memory-bandwidth-bound and benefits from distributing requests across more devices. The sharding strategy differs because the bottleneck differs (Perplexity, "Hosting Qwen on Blackwell," 2026).

With 50 concurrent streaming requests and slow-consuming clients, the inference serving actor’s memory grew linearly until OOM. Bounded output buffers—a simple asyncio queue with a maximum size—prevented the cascade. Any streaming inference deployment without backpressure is a memory leak waiting for slow clients.

Worked Example

RAG pipeline: 8,500 input tokens, 250 output tokens, Anthropic Claude Sonnet.

Using illustrative rates (exact rates in pricing snapshot):

Component

Tokens

Rate

Cost

Input (uncached)

3,000

p_in

3,000 × p_in

Input (cached)

5,500

0.1 × p_in

550 × p_in equivalent

Output

250

5 × p_in

1,250 × p_in equivalent

Even with 65% of input tokens cached, the 250 output tokens (about 3% of all tokens) account for roughly 26% of the total request cost. Output economics dominate per-token despite being the minority of tokens.

Exact multipliers vary by provider and model; the ratio shape is what matters here.

Decision rule

If output tokens dominate your workload, token price is not the primary lever. Reducing output length—through better prompts, structured output, or stopping criteria—can improve both latency and cost more than switching providers. If input tokens dominate and they are cacheable, prompt caching is the primary lever.

What To Measure

  • Input/output token ratio per workload

  • p50 and p95 output token length distribution

  • TTFT breakdown (is prefill or queue time dominant?)

  • TPOT under production concurrency

  • Cache hit rate on input tokens

Where this breaks
  • Chunked prefill can interleave prefill and decode work, blurring the phase boundary. TTFT improves but decode slots may compete with prefill chunks.

  • Speculative decoding generates multiple draft tokens per step, partially amortizing decode overhead. When it works.

  • Batch APIs process work asynchronously with different pricing and no TTFT constraint. The prefill/decode cost asymmetry still exists but is absorbed into a flat batch discount.

  • Multimodal inputs (images, audio) have different tokenization and prefill characteristics than text.

Calculator Hook

The LCPR calculator separates input and output token costs and shows the prefill/decode cost split. The sensitivity analysis reveals whether reducing output length or increasing cache hit rate has more impact on unit cost.


Chapter 6: Memory Economics

Field Problem

Same model, same hardware, same provider tier. The only change: conversations scale from 2K context to 32K-context document analysis. Throughput drops 75%. Cost per request quadruples. The provider has not changed anything. The KV cache ate the memory.

Mechanism

Every active sequence stores key and value tensors for every token it has seen—prompt tokens plus generated tokens—across every layer of the model. This is the KV cache. It is the largest dynamic memory allocation in inference serving.

Symbol

Meaning

2

Key + Value

N_layers

Transformer layers

H_kv

Key/value heads (fewer than query heads with GQA/MQA)

D_head

Head dimension

E_kv

Bytes per element (FP16=2, FP8=1)

The total KV memory for all live sequences:

M_total = 2 × N_layers × H_kv × D_head × E_kv × N_seq × T

Where N_seq is live sequences and T is prompt + generated tokens per sequence.

Available KV memory:

M_avail = M_hbm − M_weights − M_overhead

Maximum concurrent sequences:

N_seq_max = M_avail / (2 × N_layers × H_kv × D_head × E_kv × T)

Naive Answer

“Context length is a model setting. Longer context is better.”

Context length is a memory allocation. Every token of context consumes KV memory on every layer. Longer context means fewer concurrent sequences. Fewer concurrent sequences means lower throughput. Lower throughput at the same hardware cost means higher cost per token.

Better Model

KV cache creates two pressures:

Capacity pressure: Can the KV for all concurrent sequences fit in HBM? If not, sequences get preempted (evicted and later recomputed) or requests queue.

Bandwidth pressure: During each decode step, the attention mechanism reads KV state for every resident token. More tokens, more bytes read, more bandwidth consumed.

Both pressures increase with context length. But capacity pressure hits first and harder because it determines whether the sequence can be served at all, not just how fast.

Worked Example

Llama-3 8B (32 layers, 8 GQA KV heads, 128 dim) on A100 80GB:

Model weights (FP16): ~16 GB. Runtime overhead: ~4 GB. KV pool: ~60 GB.

Context length

Max live sequences

Concurrent capacity

2K tokens

~234

Plenty of headroom

8K tokens

~58

Comfortable for most workloads

32K tokens

~14

Tight; long requests evict others

128K tokens

~3

Only 3 concurrent conversations

With FP8 KV quantization (E_kv = 1 byte): all numbers double. FP8 KV is an economic lever, not just a precision choice. Operators report that microscaling FP8 (MXFP8) outperforms block-scaled FP8 on Blackwell, and that 4-bit MXFP4 without quantization-aware training degrades accuracy enough to disqualify it for most production workloads. Each step down in precision doubles capacity but narrows the workloads that tolerate the quality loss (Perplexity, "Hosting Qwen on Blackwell," 2026).

Illustrative—actual capacity depends on block allocator, runtime, prefix cache, and scheduler.

A B2C real-time voice deployment (conversational AI customer support, 70B model on a dedicated pool serving roughly 240 concurrent agent sessions at peak) watched p99 inference latency drift from 920ms to 1,480ms over a 36-hour window. Mean and median were flat. The drift was confined to the tail.

The hypothesis chain ran for two days. We suspected a noisy-neighbor effect on a shared host (disproven: dedicated pool, single tenant). We suspected weight-paging pressure from a recent context-window bump (disproven: HBM headroom was 14GB at the worst point). We suspected scheduler starvation from long agent sessions (closer, but not the root).

The actual cause surfaced when we instrumented block-allocator residency: long-running agent threads were holding partial-prefix cache lines that newer sessions could not evict cleanly. The block allocator was leaving small unusable holes between live KV regions. Each new session arrived to a more fragmented pool than the last, and prefill had to chase progressively further to find contiguous blocks. Nothing was leaking; the pool was just deteriorating in shape.

The fix was unglamorous: a server-side restart every 18 hours dropped p99 back to baseline. Each restart cost about 7 minutes of unavailability (drained, restarted, warmed). We ran the numbers against SLO-credit exposure from a sustained 1,480ms tail and the math went the right way: scheduled restarts were cheaper than letting drift accumulate. Restart-as-policy is not elegant, but it is sometimes the right call when fragmentation is sub-quadratic in time and the workload is bursty enough that a 7-minute drain hits during a natural lull.

Side-finding that did not fit the runbook: a few weeks later we moved long-running agent traffic to a separate pool with shorter session TTLs, and the drift on the main pool disappeared entirely. The drift was a workload-mix problem more than a fragmentation problem. The two failure modes (long-session KV fragmentation, mixed-workload eviction pressure) look similar from p99 telemetry; only the pool-segmentation experiment disambiguated them. The restart cadence on the segmented pools dropped to once a week, and we eventually retired it.

KV fragmentation is real. Paged attention managers (PagedAttention, block allocators in TensorRT-LLM) reduce waste, but allocation overhead and fragmentation still accumulate during long runtime. The theoretical max from the formula above is an upper bound, not an operating target: expect 20-35% effective capacity degradation under sustained mixed traffic before fragmentation forces a restart or a workload-segmentation decision, not a 16x cliff.

In an enterprise RAG deployment, mixing short-context and long-context requests on the same GPU pool caused KV cache eviction cascades. The monitoring signal was cache usage near 100%, preemptions spiking, and running requests dropping simultaneously. The fix was workload-aware routing: short context to one pool, long context to another. Same hardware, different routing, different economics.

Long Context and the Memory Wall

The KV cache formula above changes when context length grows from conversation-scale to document-scale.

Each document is 50-100 pages, tokenized to 30,000-80,000 tokens. The model supports 128K context, so the document analysis product sends the full text. Longer context does mean better analysis. But it also means: each request consumes 5-10x more KV memory, concurrent requests drop from 40 to 4 on the same hardware, TTFT increases because prefill scales with prompt length, and cost per request is 5-10x higher even if the token price is the same.

Long context is a memory product. The team bought it without understanding the bill.

KV memory scales linearly with context length:

M_total ∝ T

Doubling context length T doubles KV memory and halves the number of concurrent sequences at the same HBM capacity. But the impact is worse than 2x because:

  1. Prefill time scales with prompt length. Longer prompts mean longer prefill, which means higher TTFT. For compute-bound prefill on long prompts, TTFT can scale superlinearly with prompt length if attention is quadratic (though FlashAttention and variants reduce this).

  2. KV bandwidth scales with context. Each decode step reads KV for all resident tokens. Longer context means more bytes read per step, which means each decode step takes longer, which means TPOT can increase.

  3. Cache effectiveness drops. If each request has a unique 60K-token document as its prefix, prefix caching provides no benefit—every request has a different prefix. Long-context workloads are often low-reuse workloads.

  4. Context-tier pricing. Some providers charge more for long-context requests. This pricing reflects the serving cost: more memory consumed for longer duration, less concurrency available for other requests.

“Long context is just a better model. It understands more.”

Long context does enable better analysis of long documents. But it changes the economics. A workload that sends 80K tokens per request at batch 4 can cost 10x more per request than the same workload chunked into 8 requests of 10K tokens—and chunking may produce comparable quality for retrieval and summarization tasks, depending on the task.

Long context changes three economic variables simultaneously:

Capacity cost: More memory per request → fewer concurrent requests → higher cost per request at the same hardware budget.

Latency cost: Longer prefill → higher TTFT → risk of violating TTFT SLO. More KV to read per decode step → higher TPOT.

Quality return: Some tasks genuinely benefit from the model seeing the full document. Others work equally well with retrieval + shorter context. The right approach depends on the task, not the context window.

The decision is not “long or short context.” It is: does the quality improvement from longer context justify the capacity, latency, and cost increase?

Llama-3 8B on A100 80GB (continuing from above):

KV pool: ~60 GB. kv_bytes_per_token: 128 KB.

Workload shape

Context

Concurrent

TTFT impact

Cost per request

Support chat

2K

~234

Low

Low

RAG answer

8K

~58

Moderate

Moderate

Document QA

32K

~14

High

~4x support chat

Full doc analysis

128K

~3

Very high

~16x support chat

Illustrative throughput; queueing behavior dominates at the upper bound.

At 128K context, a single A100 can serve 3 concurrent requests. Assuming each request takes 30 seconds and the GPU runs at 65-80% effective utilization (typical for bursty long-context traffic, not the theoretical 100%), the GPU handles roughly 230-290 completed requests per hour. At $1.59/hr, that puts the GPU cost per request in the $0.006-$0.008 range—before quality gate, retries, and overhead. The naive $0.004 number assumes perfect packing and zero queueing slack, which long-context workloads almost never sustain. Any traffic spike beyond 3 concurrent requests causes queuing, preemption, or failure.

Before choosing long context: measure whether the quality improvement justifies the cost. Run the task at multiple context lengths. If 8K context with good retrieval produces 90% of the quality of 64K full-context, the 4-8x cost difference should inform the architecture choice.

For workloads that genuinely need long context, budget for the memory. Price the capacity from KV requirements, not from model weight sizing.

Decision rule

Calculate KV bytes per token for your model. Multiply by your p90 context length and your target concurrent sequences. If the result exceeds your KV pool, you need one or more of: shorter context, FP8 KV, more GPUs, tensor parallelism, or workload-aware routing that separates short and long context.

What To Measure

  • gpu_cache_usage_perc (vLLM) or equivalent KV utilization metric

  • num_preemptions or KV eviction count

  • Context length distribution: p50, p90, p99

  • Output length distribution (output tokens also consume KV while resident)

  • Concurrent active sequences over time

  • Quality delta between chunked short-context and full long-context approaches

  • TTFT and TPOT by context length bucket

  • KV utilization under long-context traffic

  • Cost per accepted result by context length tier

Where this breaks
  • MQA, GQA, MLA, sliding-window attention, and sparse attention all change effective H_kv or T_resident per layer.

  • Block/paging managers reduce waste but have allocation overhead and metadata memory.

  • Beam search, n-samples, tool branches, and parallel calls multiply live sequences.

  • CPU/host KV offload changes capacity but adds retrieval latency.

  • Cross-request prefix reuse saves prefill compute but still consumes KV memory for the shared prefix.

  • Sliding-window attention caps the effective KV per layer, reducing memory scaling at the cost of losing distant attention.

  • MLA and other KV compression techniques reduce per-token KV size, changing the capacity curve.

  • KV offload to host memory or NVMe can extend capacity but adds retrieval latency.

  • Flash Attention and variants reduce prefill memory and compute scaling but do not eliminate it.

  • Some workloads (legal document analysis, genomics, codebase understanding) genuinely need long context and cannot chunk without quality loss. For these, long context is not optional; it is a capacity requirement.

Calculator Hook

The KV capacity analysis takes model config, KV dtype, available HBM, and your context-length distribution. It outputs max concurrent sequences by context length, memory pressure alerts when your traffic would exceed capacity, and long-context cost multipliers.


Chapter 7: Prompt Caching and Rematerialization

Field Problem

Cache hit rate starts at 85%. Over two weeks, it drops to 40%. Nobody notices because the dashboard shows “caching enabled.” The bill increases 35%. Three engineers spend a week debugging what they think is a traffic spike. The customer support agent sends a 4,000-token system prompt, 2,000 tokens of tool definitions, and the growing conversation history on each turn. The system prompt and tools are identical across requests—perfect for prompt caching. The regression was caused by a one-line change that added a timestamp to the system prompt.

Mechanism

Prompt caching avoids recomputing the KV state for a prefix that has already been processed. When a request arrives and shares an exact prefix with a previously cached request, the serving system loads the cached KV directly from memory instead of running the prefill computation. This saves both compute time (faster TTFT) and compute cost (less GPU work).

The key insight: caching is a memory hierarchy, not a billing discount. The billing discount is the visible effect. The mechanism is KV reuse across requests.

The break-even formula:

N_breakeven = p_write / (p_in − p_read)

Symbol

Meaning

p_in

Uncached input token price

p_write

Cache write price (can be > p_in)

p_read

Cache read price (< p_in)

If p_write > p_in, the first request costs more with caching enabled. The savings come from subsequent reads at p_read. If reuse does not arrive within TTL, caching is a net cost.

Naive Answer

“Turn on caching. It’s a discount.”

Caching is only a discount when: 1. The prefix is stable (no timestamps, no dynamic state before the cache breakpoint). 2. Requests reuse the prefix within the TTL. 3. The cache hit rate is above the break-even point.

When any of these conditions fails, caching can cost more than uncached serving.

Better Model

Cache economics depend on three variables: stability, reuse rate, and TTL fit.

Stability: The cache key is the exact prefix content. Any change—a different timestamp, a reordered tool list, a modified retrieval chunk, a changed model—invalidates the cache. The cache-eligible prefix must be fixed across requests.

Reuse rate: How many requests share the same prefix within the TTL window? A customer support agent handling 100 concurrent conversations with the same system prompt has high reuse. A personalized recommendation system with per-user context has low reuse.

TTL fit: Does the inter-request gap fit within the cache retention window? If requests arrive every 10 minutes but the TTL is 5 minutes, most requests miss.

Worked Example

Anthropic cache economics (illustrative multipliers from public docs):

5-minute TTL: p_write = 1.25 × p_in, p_read = 0.10 × p_in

1-hour TTL: p_write = 2.00 × p_in, p_read = 0.10 × p_in

OpenAI cache economics:

p_write ≈ p_in, p_read = 0.50 × p_in

OpenAI’s cache is cheaper to enter but saves less per hit (50% vs 90% reduction). Anthropic’s cache costs more to write but saves more per hit. The right choice depends on reuse rate and TTL fit.

Google adds a third dimension: explicit context caches charge storage rent per token-hour. This adds a time cost that grows linearly with TTL. For rarely reused prefixes with long TTL, storage rent can exceed the savings.

Exact multipliers in the pricing snapshot appendix; the direction and rough magnitude are stable across providers.

When prefix caching is active, any telemetry that depends on recomputation—activation capture, token-level latency profiling, coverage metrics—systematically under-samples repeated prefixes. The prefix is served from cache, not recomputed, so the measurement never fires. A 99.5% capture rate can still be biased if the missing samples correlate with workload intensity or prompt structure.

Decision rule

Treat cache hit rate as an operational SLO, not a billing optimization. Monitor it; alert on regressions; when it drops, investigate what changed in the prefix: a timestamp, a tool schema update, a retrieval ordering change, a model version bump. Most cache killers are not malicious. They are engineering changes made by people who do not know the prompt structure matters for caching.

What To Measure

  • Cache hit rate per workload (cache_read_input_tokens / total input tokens)

  • Cache write rate (how often new prefixes are written)

  • Inter-request gap distribution for the same prefix

  • Cost difference between cached and uncached periods

  • Prefix stability (did the system prompt or tools change?)

Where this breaks
  • Timestamps, request IDs, user-specific data, or retrieval results before the static prefix invalidate the cache on every request.

  • TTL shorter than the median inter-request gap means most requests miss.

  • Provider routing can scatter requests across replicas, reducing cache locality.

  • Batch mode and enterprise tier may have different cache semantics than real-time mode.

  • Cache hit rate can look good while quality degrades if the cached prefix includes stale context.

  • Different providers have different cache-breaking events. Changing tool definitions invalidates the entire cache on Anthropic. Changing the effort/speed setting invalidates it. Model changes invalidate it everywhere.

Calculator Hook

The cache break-even analysis takes the pricing snapshot, prefix tokens, expected reuse rate, and TTL. It outputs break-even calls, savings at given reuse, and waste when reuse fails. The sensitivity analysis shows how cache hit rate changes total cost.


Chapter 8: Model Architecture and Cost

Field Problem

Inference latency is 3x higher than the benchmark predicted. The Mixture-of-Experts model—671B total parameters, 37B active per token—fits in memory, barely. The throughput-per-dollar numbers looked strong. But expert routing creates all-to-all communication traffic between GPUs, and the deployment’s interconnect is too slow to handle it. The benchmarks ran on an NVLink-connected 8-GPU server. The deployment uses PCIe GPUs with Ethernet scale-out.

Mechanism

Mixture-of-Experts models route each token to a subset of “expert” subnetworks. This means three things at once:

  • Active parameters per token (the experts actually invoked) are a small fraction of total parameters.

  • Expert dispatch must complete within each decode step, which makes interconnect latency a first-order constraint.

  • All experts must stay memory-resident across the GPU pool, so total parameters still set the memory bill.

The economic promise of MoE: active compute is much smaller than total parameters, so inference can be faster and cheaper, as long as the experts are in memory and reachable at low latency.

The economic catch: experts may be placed on different GPUs. Routing tokens to their experts and combining results creates all-to-all communication. This communication must complete within each decode step. If the interconnect is slow (Ethernet, PCIe), the communication becomes the bottleneck—not the compute, not the memory bandwidth.

Naive Answer

“MoE has 671B parameters but only activates 37B. It’s basically the cost of a 37B model.”

Not if the 671B parameters need memory residency and expert dispatch needs fast interconnect. The cost is:

  • Compute: proportional to active parameters (good)

  • Memory capacity: proportional to total parameters (unchanged)

  • Memory bandwidth: depends on weight fetch (total resident, but only active read) plus KV

  • Communication: proportional to expert dispatch traffic and interconnect speed (new cost)

MoE shifts cost from compute to communication and memory residency. Whether that is a net win depends on the topology.

Better Model

The critical question is whether expert dispatch stays inside a low-latency scale-up domain or crosses slower scale-out links.

Topology

Expert dispatch latency

MoE economics

8-GPU NVLink (900 GB/s)

Low

MoE sparsity advantage realized

72-GPU NVLink domain (1.8 TB/s/GPU)

Very low

Best case for large MoE

PCIe within server

Medium

Partial advantage

Ethernet across servers

High

Sparsity advantage can be lost to communication

At an enterprise deployment, Deep Packet Inspection on the network path added 5-15ms to every inter-node RPC. For tensor parallelism with thousands of allreduce operations per second, a 70B model ran 10x slower than benchmarks. The fix—a dedicated ML subnet—took six weeks of security reviews. Network topology is a serving physics variable that no benchmark captures.

Expert dispatch on MoE is even more sensitive than tensor-parallel allreduce: traffic is more frequent and less predictable. If the network path is slow, MoE serving pays the penalty on every layer of every decode step.

Worked Example

DeepSeek-V3 style MoE: 671B total, 37B active, 8 experts activated per token.

Weight residency (FP8): 671B × 1 byte = 671 GB. Does not fit on a single GPU. Requires at least 4× H200 (141 GB each, ~564 GB usable) or 8× H100 (80 GB each, ~640 GB usable). Expert placement across GPUs means routing traffic at every MoE layer.

On an 8-GPU NVLink server (900 GB/s bidirectional): expert dispatch for 8 active experts per token across 8 GPUs creates all-to-all traffic. NVLink bandwidth is sufficient for moderate batch sizes. Latency penalty per layer is small.

On Ethernet scale-out (25-100 Gbps): the same expert dispatch creates orders-of-magnitude more relative latency. Expert-to-expert communication that takes microseconds on NVLink can take milliseconds on Ethernet. This accumulates across layers.

Quantitative latency depends on implementation, batch size, and actual routing patterns; the relative shape is robust.

Disaggregated Prefill/Decode and Maturity Gates

MoE is one architectural choice that changes serving economics. Disaggregated prefill/decode is another.

The idea is compelling: separate prefill workers (compute-heavy) from decode workers (memory-bandwidth-heavy) so each can be optimized independently. After three weeks of implementation, latency improves for long-prompt workloads. But overall system reliability drops: KV transfer adds a new failure mode, autoscaling becomes harder because prefill and decode pools scale independently, and debugging multi-hop request paths is more difficult.

The optimization is real. The maturity gate was not met.

In colocated serving, prefill and decode run on the same GPU. A long prefill burst can preempt decode work and spike TPOT for in-flight decode requests. A large decode batch can delay new prefill work and spike TTFT for incoming requests.

Disaggregated serving splits these phases:

  1. Prefill workers process the input prompt and produce KV state.

  2. KV transfer moves the computed KV from prefill worker to decode worker.

  3. Decode workers receive the KV and generate output tokens.

Benefits: - Prefill and decode can use different hardware configurations (more compute for prefill, more bandwidth for decode). - Prefill bursts do not interfere with decode latency. - TTFT and TPOT can be scaled independently. - Each pool can be right-sized for its workload shape.

Costs: - KV transfer adds latency and network bandwidth requirements. - Prefill-to-decode routing becomes a scheduling problem. - Two pools to autoscale instead of one. - New failure modes: KV transfer failure, prefill/decode imbalance, transfer timeout. - Debugging is harder because a single request traverses multiple workers.

In an edge deployment across 100+ locations, T4 GPUs hit thermal limits during peak traffic. Ambient site temperatures of 30-35C pushed GPU temperatures to the 83C throttle point. p99 latency doubled at 40+ sites simultaneously. Power capping from 70W to 60W stabilized tail latency at the cost of 15% peak throughput. The hardware datasheet does not specify operating ambient at the upper end of edge-site reality.

This is not about disaggregation. It is about the gap between benchmark conditions and production conditions. The same gap applies when evaluating disaggregation: benchmark demonstrations run in controlled environments with stable traffic, fast interconnect, and simple failure modes. Production adds thermal throttling, network jitter, traffic bursts, cold starts, and operational complexity.

Disaggregation readiness is a gate, not a toggle. Six conditions should be met before disaggregation is the right optimization:

1. Is TTFT dominated by long-prompt prefill? If most prompts are short (under 2K tokens) and TTFT is already within SLO, prefill is not the bottleneck and disaggregation does not help.

2. Is decode latency being interrupted by prefill bursts? If TPOT is stable under traffic, colocated serving is handling the phase interference well. Chunked prefill (interleaving prefill chunks with decode steps) can reduce this interference without full disaggregation.

3. Is there enough traffic to keep separate pools utilized? Disaggregation creates two pools that each need minimum utilization. If traffic is too low, one pool sits idle while the other is overloaded. This wastes capacity.

4. Can KV transfer fit inside the SLO? If the KV state for a request is 500 MB and the interconnect between prefill and decode workers is 25 Gbps Ethernet, transfer alone takes ~160ms. For a 300ms TTFT target, that is more than half the budget consumed by transfer.

5. Can the platform autoscale prefill and decode independently? If the autoscaler cannot distinguish prefill pressure from decode pressure, it will scale the wrong pool or both pools together—losing the independent scaling advantage.

6. Can the team debug multi-step routing failures? When a request fails, is it because prefill failed, KV transfer timed out, the decode worker rejected the KV, or the routing layer misrouted? If the team cannot diagnose these failure modes quickly, disaggregation adds incident duration.

Disaggregation economics for a long-prompt document analysis workload:

Scenario

Colocated

Disaggregated

Mean prompt length

40K tokens

40K tokens

Mean output length

500 tokens

500 tokens

p99 TTFT

2,800ms (prefill burst interference)

1,200ms (dedicated prefill pool, inclusive of ~80ms KV transfer)

p99 TPOT

45ms/tok

38ms/tok (no prefill interference)

KV transfer overhead

0

~80ms per request

Operational complexity

Baseline

2x autoscaling, new failure modes

GPU utilization

72% average (but bursty)

65% prefill, 80% decode

Net TTFT improvement

57%

Infrastructure cost

Baseline

+15% (two pools, transfer networking)

Illustrative; real numbers depend on hardware, interconnect, traffic shape, and framework maturity.

For this workload, disaggregation significantly improves TTFT at a 15% infrastructure cost increase. Whether that is worthwhile depends on whether TTFT is the binding constraint. If the customer cares about total processing time for a batch of documents, the 57% TTFT improvement may not matter.

Do not disaggregate until you have: (1) measured that prefill/decode interference is the binding latency constraint, (2) enough traffic to keep both pools utilized, (3) interconnect fast enough for KV transfer within SLO, (4) independent autoscaling for both pools, and (5) observability for multi-hop request debugging.

Chunked prefill is the first optimization to try. It reduces phase interference without the operational cost of disaggregation.

Decision rule

Before deploying an MoE model, verify that your deployment topology keeps expert dispatch inside a low-latency interconnect domain. If your GPUs are connected by Ethernet or slow PCIe, the communication cost may erase the sparsity advantage. Treat “active parameters” as a compute metric, not an economic summary. Memory residency, communication, and topology are the other three cost dimensions.

Before disaggregating prefill/decode, verify the six readiness conditions above. If the workload is short-prompt or low-traffic, the operational cost exceeds the benefit.

What To Measure

  • Expert routing skew (are some experts overloaded?)

  • All-to-all communication time per layer

  • Total communication overhead as fraction of decode step

  • NVLink vs Ethernet dispatch latency in your deployment

  • Whether batch size is limited by KV capacity or expert routing

  • TTFT breakdown: is prefill time, queue wait, or routing delay dominant?

  • TPOT stability: does TPOT spike during prefill bursts?

  • KV transfer time: network bandwidth between prefill and decode workers

  • Prefill and decode pool utilization separately

  • Request failure rate by stage (prefill, transfer, decode)

Where this breaks
  • Expert routing policies (top-k, load balancing, auxiliary losses) change communication patterns.

  • Shared experts (used by every token) create a different memory and compute profile than routed experts.

  • MLA (Multi-head Latent Attention) and similar techniques can reduce KV requirements per token for MoE models.

  • Pipeline parallelism can reduce per-stage memory pressure but adds inter-stage latency.

  • Software stack maturity varies. Expert parallelism kernel support in vLLM, SGLang, and TensorRT-LLM is not identical.

  • Small deployments (1-4 GPUs) do not have enough hardware to split meaningfully for disaggregation.

  • Short-prompt workloads (support chat, classification) gain little from disaggregation because prefill is already fast.

  • Mixed workload clusters may benefit more from workload-aware routing than from full disaggregation.

  • KV compression and quantization can reduce transfer costs but add compute overhead on the sending or receiving side.

Calculator Hook

The serving physics analysis for MoE takes total parameters, active parameters, expert count, activated expert count, GPU count, and interconnect bandwidth. It outputs whether the deployment is compute-limited, memory-limited, or communication-limited. For disaggregation, input prompt length distribution, output length distribution, current TTFT/TPOT, interconnect bandwidth, and traffic volume. Output: expected TTFT improvement, KV transfer overhead, pool sizing, and pass/fail on each readiness condition.


Chapter 9: Productive Capacity and Routing

Field Problem

A capacity planning team reports 78% GPU utilization and calls the cluster “well utilized.” The product team reports that 30% of requests exceed the p99 latency SLO. The support team reports rising escalation rates. Finance asks why the bill is growing while utilization is high.

78% GPU utilization measures how often the GPU is doing something. It does not measure how much of that work produced accepted output at target latency. The GPU can be busy recomputing evicted KV cache, running speculative decode drafts that get rejected, processing retries, or serving requests that will fail quality gates.

Mechanism

Throughput is work attempted: total tokens generated per second.

Goodput is work completed: accepted tokens generated per second, meeting latency, quality, and reliability constraints.

The denominator is accepted requests. The numerator includes ALL costs—including failed attempts, retries, quality failures, and wasted speculative tokens. This is the connection between serving physics and LCPR: goodput is the bridge between GPU work and accepted economics.

The gap between allocated and productive GPU capacity is large. A 2024 industry survey found the majority of organizations achieve less than 70 percent GPU Allocation Utilization at peak demand, with common figures closer to 10-20 percent. One serverless provider has engineered cold start latency from tens of minutes down to approximately 50 seconds through GPU memory checkpointing and pre-warmed machine buffers, enabling tighter supply-demand matching for inference workloads (State of AI Infrastructure at Scale, 2024; Modal, "Truly Serverless GPUs," 2026).

Naive Answer

“High utilization means the GPU is efficient.”

Utilization can be high while goodput is low. Reasons:

  • Retries and repairs consume GPU time on work that will not be accepted.

  • KV cache preemptions and recomputation waste bandwidth on repeated prefill.

  • Speculative decode drafts consume compute even when the acceptance rate is low.

  • Quality failures at the model level waste every token from the failed request.

  • Queue buildup causes SLO violations that are not visible in utilization metrics.

In a high-concurrency interactive workload, speculative decoding with a domain fine-tuned 1B draft model produced roughly 0.90x throughput: slower, not faster. The acceptance rate improved from about 48% to 58%, but under production batch sizes of 10-14 the extra verification compute exceeded the drafting savings. Speculative decoding works best at low concurrency with high acceptance rates; once verification has to run across a wide batch every step, the economics can flip.

The interactive-workload result is workload-dependent, not a general indictment of speculative decoding. A search infrastructure provider reports positive production results with in-house multi-token-prediction draft layers, likely because the search workload's constrained output distribution yields higher draft acceptance rates than open-ended generation. The decision variable is not "does speculative decoding work" but "does your workload's output distribution produce an acceptance rate high enough to offset the draft model's compute under your operating batch size?" (Perplexity, "Hosting Qwen on Blackwell," 2026).

After quantizing a production model, quality failures appeared in the long tail: rare entities, numeric precision errors, and multi-turn conversation degradation after 5-6 turns. Standard A/B tests did not catch these. Continuous log analysis—reviewing real production outputs—was the only reliable detection method. Quantization improves throughput by 1.6x and cuts memory by 3-4x, but the quality risk is in the tail, not the average.

Quantization increases raw throughput and utilization. But if it introduces quality failures that increase retries, human escalation, or silent degradation, the productive capacity—measured in LCPR—may not improve.

Better Model

Replace utilization with productive capacity:

Metric

Measures

Useful for

GPU utilization %

Time GPU is active

Detecting idle hardware

Raw throughput (tok/s)

Tokens generated

Peak capacity sizing

Goodput (accepted req/s)

SLO-compliant work per second

Production capacity planning

Cost per accepted unit

Economics per accepted work

LCPR-driven decisions

Worked Example

Two configurations for the same RAG workload, 100 requests each:

Metric

Route A

Route B

Raw throughput (tok/s)

1,200

800

p99 TTFT

1,400ms (fails 800ms SLO)

650ms (passes)

p99 TPOT

55ms/tok

42ms/tok

Quality pass rate

72%

91%

Requests meeting ALL gates

58

85

Goodput (accepted req/s)

5.8

8.5

Total cost (100 req)

$1.10

$1.45

Cost per accepted result

$0.019

$0.017

Route A wins on raw throughput and total cost. Route B wins on goodput and cost per accepted result. A benchmark picks Route A. A goodput frontier test picks Route B.

Illustrative; the goodput-frontier shape holds across most production routing decisions.

Stateful Routing and Cache Locality

One of the most impactful levers for productive capacity is where the work goes.

Eight replicas behind a round-robin load balancer. Cache hit rate drops from 85% to 12%. TTFT increases 4x. The support chatbot’s cache configuration has not changed. The problem is the load balancer: it sends each request to a random replica, so the same conversation’s system prompt is cached on replica 3 but the next turn goes to replica 7, where the cache is cold.

Traditional load balancing treats replicas as interchangeable. Inference replicas are not interchangeable. Each replica holds different KV state.

Inference serving systems maintain per-replica state:

  • KV cache: Each replica holds KV blocks for recently served prefixes. A cache hit requires routing the request to the replica that holds the matching prefix.

  • Prefix cache: Some systems hash prefixes and store KV blocks by hash. The same prefix hashed on different replicas creates duplicate storage; routing the request to the right replica avoids recomputation.

  • Session state: Multi-turn conversations benefit from sticky routing to the replica that holds the conversation’s growing context.

Routing policies for inference:

Policy

Mechanism

When it works

When it fails

Round-robin

Even distribution

Stateless or batch workloads

Destroys cache locality

Least-request

Route to least-loaded

When all replicas are cold

Ignores cache state

Prefix-aware

Route by prefix hash

High-reuse prefixes

Low-reuse or unique-prefix workloads

KV-cache-aware

Route to replica with matching KV

Multi-turn, high cache benefit

Requires cache state visibility

Sticky session

Route same session to same replica

Conversational workloads

Creates hotspots if sessions are uneven

Fallback

Route to any when target is saturated

Burst handling

Cold start penalty on fallback

“Use standard HTTP load balancing. It works for web traffic.”

Web server replicas are stateless. Inference replicas are stateful when KV cache, prefix reuse, or session affinity matter. Routing a request to the wrong replica means cache miss → full prefill recomputation → higher TTFT, higher cost, wasted memory on the original replica’s cached state.

Cache-local routing creates two tradeoffs. The first is locality versus load balance: routing to the cache-hot replica improves TTFT and cost but can overload that replica if many requests share the same prefix. The routing layer needs a saturation threshold that routes to the cache-hot replica when it has capacity and spills elsewhere when it does not.

The second is cache diversity versus cache efficiency. If all requests go to the same replica, one replica carries the load and the rest sit idle. If requests are spread evenly, cache hit rates drop. The optimal distribution depends on prefix diversity, traffic shape, and the relative cost of cache misses versus load imbalance.

Support chatbot: 8 replicas, 1 system prompt + tool prefix (6,000 tokens), 10 concurrent conversations.

Routing policy

Cache hit rate

Mean TTFT

Effective cost

Round-robin

~12% (1/8 chance)

1,200ms

Baseline

Prefix-aware

~85% (routed to prefix holder)

300ms

~60% savings on input cost

Sticky session

~90% (conversation state preserved)

250ms

~65% savings

Illustrative; exact hit rates depend on traffic shape, TTL, and replica capacity.

The difference between round-robin and prefix-aware routing is 4x TTFT and 60% cost reduction—not from a better model or cheaper hardware, but from routing the request to the right place.

Decision rule

Measure and report goodput under SLO, not raw throughput or utilization. GPU-hour math without goodput under SLO is spreadsheet fiction.

When running benchmarks, define your SLO (TTFT, TPOT, E2E, quality pass rate) before measuring. Run with your actual prompt/output distribution, not synthetic uniform inputs. Use Poisson arrivals for online serving benchmarks, not closed-loop. Report cost per accepted result, not cost per total request.

If your workload has reusable prefixes and you use multi-replica serving, routing policy is an economic lever. Implement prefix-aware or KV-cache-aware routing before spending on more hardware. Measure cache hit rate by replica. If hit rates are low but prefix diversity is low (most requests share a few prefixes), routing is the problem.

What To Measure

  • Goodput: accepted results per second under TTFT, TPOT, and quality gates

  • Cost per accepted result (all costs in numerator, accepted only in denominator)

  • Retry rate and retry cost as fraction of total cost

  • Quality failure rate by failure mode

  • Speculative decode acceptance rate and net throughput impact

  • KV eviction rate and recomputation overhead

  • Cache hit rate per replica

  • Prefix overlap percentage (how many requests share the top-K prefixes?)

  • TTFT on cache hit vs cache miss

  • Routing spillover rate (requests sent to non-ideal replica due to saturation)

Where this breaks
  • Without quality labels, goodput degrades to “latency-constrained throughput”—still useful, but incomplete.

  • Streaming workloads care about jitter and inter-token latency, not just mean TPOT.

  • Batch workloads use completion-window SLOs, not interactive TTFT/TPOT.

  • Multi-turn conversations have different goodput semantics: is the unit a turn or a session?

  • Low-reuse workloads (unique documents, personalized prompts) do not benefit from prefix-aware routing because every request has a different prefix.

  • Scaling up replicas dilutes cache locality. Adding replicas to handle traffic can reduce cache hit rate, counterintuitively increasing per-request cost.

  • Disaggregated prefill/decode complicates routing because the prefill worker and decode worker may be different—and the KV must transfer between them.

  • Multi-region deployments may not share cache state, so cross-region routing is always a cold start.

  • Prefix hash collisions, KV eviction, and cache capacity limits mean cache-aware routing is probabilistic, not guaranteed.

Calculator Hook

The goodput calculator takes per-request latency, quality labels, and cost data. It outputs goodput by request rate, cost per accepted unit, and the recommended operating point. The serving physics analysis includes a cache-locality mode: input prefix diversity, replica count, and traffic shape to see expected cache hit rate by routing policy and cost differences.


Part 2 Summary

Six mechanisms set inference serving cost and capacity: the hardware roofline and the batch-size frontier, the prefill/decode asymmetry, KV memory economics, prompt caching as a memory hierarchy, MoE topology and disaggregation readiness, and productive capacity under SLO. Each is independently load-bearing; together they are the vocabulary the rest of the book uses. When Part 3 calls a support workload "output-bound," it means the prefill/decode economics apply. When Part 4 says "check KV capacity before choosing dedicated," it means the memory-economics formula. When Part 5 says "measure goodput, not throughput," it means the productive-capacity definition.

The physics explains the economics. The economics determine the decisions.


Evidence Notes for Part 2

Claim

Type

Source

Roofline model applies as a bound for inference steps

Public, Derived

Williams et al. 2009 (CACM), adapted for transformer decode

Batch amortizes weight fetch but not KV fetch

Derived, Reported

First-principles from serving physics; consistent with Reiner Pope transcript framing

Decode is often memory-bandwidth-bound at moderate batch

Derived

Follows from weight-fetch and KV-fetch analysis at typical production batch sizes

Prefill is parallel and can be compute-bound

Public

SARATHI (Agrawal et al. 2023), Orca (Yu et al. 2022)

Output/input price ratio reflects prefill/decode cost asymmetry

Inferred

Observed across providers; consistent with serving physics but includes margin and product strategy

KV cache sizing formula

Derived

Standard transformer architecture formula

KV fragmentation causes p99 drift

Reported

Operator proof box 4 (public blog post)

KV eviction cascades in mixed short/long workloads

Reported

Operator proof box 6 (public blog post)

Prompt cache break-even formula

Derived

From provider-documented write/read pricing

Provider cache semantics (TTL, write/read pricing)

Public

Provider docs; exact values in pricing snapshot appendix

Prefix caching creates measurement bias

Reported

Operator proof box 10 (public research post)

Long context roughly halves concurrent capacity per doubling

Derived

From KV cache sizing formula

MoE reduces active compute but adds all-to-all communication

Public, Derived

DeepSeek-V3 report, NVIDIA NVL72 docs, adapted for serving

Network DPI can 10x inference latency

Reported

Operator proof box 9 (public blog post)

Speculative decoding can be net negative under production batch

Reported

Operator proof box 3 (public blog post)

Quantization quality regression in long tail

Reported

Operator proof box 5 (public blog post)

Streaming without backpressure is a memory leak

Reported

Operator proof box 8 (public blog post)

Hard latency budgets expose what averages hide

Reported

Operator proof box 1 (public blog post)

T4 thermal throttling at edge

Reported

Operator proof box 2 (public blog post)

Goodput is accepted work per second under SLO

Derived

From SLO definition and standard production metrics

GPU utilization can be high while goodput is low

Derived, Reported

First-principles; consistent with multiple operator observations

Round-robin routing destroys cache locality

Derived

From prefix cache mechanism; consistent with provider engineering posts

Disaggregated P/D requires maturity gates

Public, Opinion

vLLM experimental docs, NVIDIA Dynamo docs; readiness conditions are author’s synthesis

Worked examples (7B on A100, KV sizing, cache break-even, goodput comparison)

Synthetic

Illustrative numbers derived from public hardware specs and provider pricing structures

Disaggregated P/D sharding: TP=4 prefill, EP=16 decode

Public

Perplexity, “Hosting Qwen on Blackwell,” 2026

MXFP8 works in production; MXFP4 without QAT does not

Public

Perplexity, “Hosting Qwen on Blackwell,” 2026

Speculative decoding positive result with MTP layers

Public

Perplexity, “Hosting Qwen on Blackwell,” 2026

GPU Allocation Utilization commonly 10-20%, majority <70% at peak

Reported

Modal citing State of AI Infrastructure at Scale, 2024

Cold start reduced from tens of minutes to ~50 seconds

Public

Modal, “Truly Serverless GPUs,” 2026

All worked examples use synthetic numbers shaped by real hardware specs and provider billing grammar. No numbers are attributed to any specific employer, customer, or deployment. Where operator proof boxes are used, they reference published public blog posts with approved anonymized wording.