Part 1 defined the economic unit: loaded cost per accepted result, not token price. Part 2 explains why the numbers come out the way they do. The mechanisms here (roofline reasoning, batch amortization, KV cache pressure, prefill/decode asymmetry, prompt caching, and productive capacity) are the vocabulary that Parts 3 through 5 depend on.
A reader who skips Part 2 can still use LCPR and the calculator. But they will not know why cache hit rate changes cost, why output tokens cost more than input tokens, why batch size sets the latency-cost frontier, or why “GPU utilization” can be high while productive work is low. The physics explains the economics.
Chapter 4: The Hardware Cost Floor
Field Problem
The benchmark says 4,000 tokens per second. Production delivers 800. The 7B model is deployed on an A100 80GB, same hardware the benchmark used. The first hypothesis is misconfiguration. The second is that the benchmark was fraudulent. Both are wrong. The benchmark ran at batch 256 with synthetic short prompts and uniform 50-token outputs. Production runs at batch 4-8 with mixed prompt lengths, p99 latency constraints, and quality gates that reject 10% of outputs.
The benchmark measured peak throughput at a point on the roofline that production never reaches. The gap is not a bug. It is the shape of inference physics.
Mechanism
The roofline model comes from high-performance computing. For any computation, performance is bounded by two ceilings: compute throughput and memory bandwidth. The computation cannot exceed either ceiling. Where the workload sits relative to these ceilings determines what limits it.
For a single decode step in a dense transformer:
T_step ≥ max(W_active / BW_eff + N_batch × M_kv / BW_eff, N_batch × O_tok / F_eff)
Where:
Symbol |
Unit |
Meaning |
|---|---|---|
B |
sequences |
Active decode batch size |
W_active |
bytes |
Active weight bytes read per step |
BW_eff |
bytes/s |
Effective sustained memory bandwidth |
F_eff |
ops/s |
Effective sustained compute throughput |
O_tok |
ops/token |
Operations per generated token |
M_kv |
bytes/token |
KV bytes read per token at current context |
Two things to notice. First, this is a max(), not a sum. The bottleneck is
whichever ceiling is lower for the current workload. Second, W_active appears
once regardless of batch size: weight bytes are read once per step and
shared across the batch. N_batch × M_kv grows with batch because each live sequence has
its own KV state.
Naive Answer
“My GPU has 312 TFLOPS. My model does X FLOPs per token. Therefore throughput is 312T / X.”
This ignores memory bandwidth entirely. Most production decode serving is not compute-bound. At the batch sizes where latency SLOs are met, the system is reading weights and KV state from HBM faster than it can compute. The GPU’s compute units often wait for data.
Better Model
Inference serving lives in three regimes:
Memory-bandwidth-bound (small batch). Weight fetch dominates. Each decode step reads the full active weights from HBM. Cost per token is high because few sequences share the weight read. This is where most latency-sensitive serving operates.
Transitional (medium batch). Weight fetch is partially amortized. KV fetch is growing. Neither compute nor bandwidth clearly dominates. This is often the operating sweet spot under SLO constraints.
Compute-bound (large batch). Weight fetch is fully amortized over the batch. Compute per token becomes the floor. Cost per token stops falling. But latency rises because more tokens compete for compute and KV memory.
The roofline does not tell you which regime is “best.” The SLO, workload shape, and quality gate determine where on the roofline you can operate.
Worked Example
Setup: Dense 7B model, FP16 weights (14 GB), A100 80GB SXM.
Parameter |
Value |
Source |
|---|---|---|
W_active |
14 GB |
Model config × 2 bytes/param |
BW_eff |
1,800 GB/s |
Sustained, ~88% of peak 2,039 GB/s |
F_eff |
250 TFLOPS |
Sustained FP16, ~80% of peak 312 |
O_tok |
~14 GFLOPs/token |
2 × params for decode (approximate) |
M_kv |
~0.5 MB/token at 2K context |
Model-specific |
c_accel |
$0.00044/s |
~$1.59/hr on-demand |
p_pass |
0.90 |
90% accepted |
Batch |
Bottleneck |
Raw cost/MTok |
Accepted cost/MTok |
|---|---|---|---|
1 |
Memory (weight fetch) |
~$3.44 |
~$3.82 |
8 |
Memory (amortizing) |
~$0.49 |
~$0.54 |
32 |
Transitional |
~$0.19 |
~$0.21 |
128 |
Compute-bound |
~$0.13 |
~$0.14 |
256 |
KV pressure rising |
~$0.13 |
~$0.14 (if SLO holds) |
All numbers are illustrative. Actual throughput depends on scheduler, quantization, tensor parallelism, prompt/output mix, and prefix caching.
Cost falls about 27x from batch 1 to batch 128 in this worked example, then stops falling. The benchmark's 4,000 tok/s came from batch 256+. Production at batch 4-8 lives in the expensive part of the curve, by design: that is where latency SLOs are met.
A real-time conversational system with a 1,500ms end-to-end budget allocates roughly half to model inference (300-800ms depending on output length). When p99 inference latency drifts 5-10% per hour, the drift is invisible against a stable mean for a long time: a 600ms p99 climbing to 720ms still passes most aggregated dashboards. On one workload I helped diagnose, the drift went unnoticed for about 36 hours because the alarms were thresholded on p99 absolute, not p99 trend. The fix took ten minutes (pin the KV pool refresh interval), but the detection took a day and a half. Average and even p99-absolute can mask compounding behavior in a stateful serving system. Add trend-based alarms on p99 and KV utilization, not just absolute thresholds.
The Batch Size Frontier
The roofline sets the physics. The batch policy determines the economics, and it is often invisible in the token price.
A platform engineer pitches a switch: "this serverless endpoint is 3x cheaper per token." The team migrates and the bill drops. p99 latency doubles. Support tickets about "slow AI" rise. The mistake was comparing token prices without understanding that the cheap endpoint achieves low cost by running large batches, trading latency for throughput.
This is not fraud. This is the batch size frontier: every serving system makes a choice about where on the cost-latency curve to operate, and that choice is invisible in the token price.
From the roofline, weight fetch (W_active) amortizes across the batch: each additional sequence
shares the same weight read. KV fetch (N_batch × M_kv) does not amortize: it grows with
batch because each sequence brings its own context. After weight fetch
is amortized, compute becomes the floor.
The consequence: a provider running at large batch gets cheaper cost per token. A provider running at small batch gets lower latency. The token price reflects the batch policy, not just the hardware or model.
The bus-versus-taxi analogy fits: shared vehicles are cheaper per passenger-mile, but you wait for departure and stops; the taxi is expensive but leaves on your timing. The question is not which is cheaper. It is which serves the passenger's time constraint.
“The cheap provider is just more efficient.”
Maybe. But if the cheap provider achieves 3x better cost by running at 4x larger batch, the latency tradeoff is baked in. The provider’s efficiency is real, but it is not free. The cost was paid in latency and potentially in tail-latency variance.
The batch size frontier has three zones:
Zone 1: Weight-bound. Small batch. Weight fetch dominates. Cost per token is high. Latency per token is low (few tokens compete for resources). This is where real-time voice, interactive chat with strict TTFT requirements, and low-concurrency dedicated endpoints operate.
Zone 2: Amortization sweet spot. Medium batch. Weight fetch is amortized. KV fetch is growing but manageable. Compute has not yet become the floor. This is where most production serverless APIs operate. The token price reflects partial amortization.
Zone 3: Compute floor. Large batch. Weight fetch is negligible. Compute per token is the binding constraint. Cost per token stops falling. But TPOT rises because decode steps take longer when the batch is large. TTFT can rise sharply if prefill requests queue behind a large decode batch. This is where offline batch processing and some throughput-optimized serverless tiers operate.
The provider’s published token price reflects which zone they target. A low token price usually means Zone 2 or 3. A high token price with tight latency guarantees usually means Zone 1. Neither is wrong. The question is which zone fits the workload’s latency and quality requirements.
Using the same 7B model:
Zone |
Batch |
Cost/MTok (accepted) |
p99 TPOT |
Use case fit |
|---|---|---|---|---|
Weight-bound |
1-4 |
$1.90-$3.82 |
8-32ms |
Voice, real-time |
Amortization |
8-64 |
$0.14-$0.54 |
32-80ms |
Interactive chat |
Compute floor |
128-512 |
$0.13-$0.14 |
80-200ms+ |
Batch, offline |
Illustrative numbers; exact figures depend on hardware, model, scheduler, and SLO.
The cost difference between Zone 1 and Zone 3 is ~27x. That gap explains why “same model, different provider, different price” is not necessarily unfair. It may reflect different batch policy choices.
The roofline determines where your workload sits on the cost-latency frontier. Do not plan capacity from peak throughput. Plan from the batch size where your p99 latency starts violating your SLO. Calculate cost at that operating point.
When comparing providers or deployment modes, ask: at what batch size does this system operate under my traffic load, and does the resulting latency meet my SLO? The cheapest provider that violates your latency SLO is not cheaper. It is more expensive, because the latency failures create retries, timeouts, or customer dissatisfaction that increase LCPR.
What To Measure
Sustained memory bandwidth for your GPU/engine (not peak spec)
Active decode batch size under production traffic
TTFT and TPOT distribution at each batch size in a sweep
The batch size where p99 latency crosses your SLO threshold
Cost per accepted output token at the operating batch size
Latency variance (p50 vs p99) as traffic changes batch size
The roofline’s
max()is a teaching model. Real serving kernels overlap compute and memory to varying degrees. The roofline is a bound, not a prediction.Quantization, tensor parallelism, speculative decoding, and MoE change the effective arithmetic intensity and shift the transition point.
Prefill and decode have different roofline shapes. This derivation covers decode only. Prefill can be compute-bound at much smaller batch sizes.
KV memory capacity can force smaller batches even when the roofline says larger batches would be cheaper.
Continuous batching complicates the picture. Modern serving engines do not run fixed batches. They add and remove sequences dynamically at each decode step. The “batch size” is a running average, not a constant.
Multi-model routing can mean different workloads see different effective batch sizes on the same hardware.
Queue time can dominate latency at high utilization, independent of batch size.
Calculator Hook
The with serving physics analysis takes hardware cost, measured bandwidth/compute, weight bytes, SLO constraints, and traffic rate. It outputs the cost curve by batch size, the memory vs compute regime boundary, the operating zone for your traffic, and infeasible batch points.
Chapter 5: Prefill, Decode, and Why Output Tokens Cost More
Field Problem
The prompt is long: 8,000 tokens of retrieved context plus a 500-token system prompt. The output is short, typically 150-300 tokens. This RAG pipeline’s team sees that output tokens cost 3-5x more than input tokens on most provider pricing pages. They assume this is provider markup. It is not markup. It is physics.
Mechanism
Inference has two phases:
Prefill processes the entire input prompt in one forward pass (or chunked passes). All input tokens are processed in parallel. Prefill can saturate the GPU’s compute resources at relatively small batch sizes because the parallelism comes from the prompt length, not the number of concurrent sequences. Prefill is often compute-bound for long prompts.
Decode generates output tokens one at a time, autoregressively. Each decode step produces one token per sequence. The next token depends on all previous tokens through the KV cache. Decode is sequential. At moderate batch sizes, decode is often memory-bandwidth-bound because each step reads the model weights and KV state but produces relatively little compute work per step.
The economic consequence is that output tokens are structurally more expensive than input tokens. A rough cost ratio for a dense transformer is:
cost_out / cost_in ≈ (FLOPs_per_decode_token / FLOPs_per_prefill_token) × (1 / batch_amortization)
Provider pricing reflects this asymmetry. When a provider charges 3x for output tokens, they are not adding 3x margin. They are passing through the structural cost difference between prefill and decode.
Naive Answer
“Output tokens cost more because providers want higher margins on generation.”
Providers do have pricing strategy. But the output/input price ratio (typically 2-6x on major API providers as of the May 2026 pricing snapshot) is broadly consistent across providers with very different business models. The ratio varies: some providers price output at 3x input, others at 4x or 6x, and some serverless open-model endpoints use symmetric pricing. The ratio persists across most providers because it approximates the structural cost difference between prefill and decode, though product strategy and market positioning also influence it.
Better Model
Decompose the user-visible latency:
T_user = TTFT + N_out × TPOT
TTFT depends mostly on prefill. Long prompts mean longer TTFT. TPOT depends on decode. Each output token adds one decode step. Output-heavy workloads (chatbots generating long answers, code agents producing files) spend most of their wall time and GPU time in decode.
For economics: input tokens have low marginal cost when prefix-cached (the KV is served from memory, prefill is skipped). Output tokens always pay the decode cost. Cached input tokens can be 90% cheaper than uncached. Output tokens cannot be cached in the same way—they do not exist until generated.
The phase asymmetry is not academic. Operators serving large MoE models increasingly deploy separate hardware for each phase: one search infrastructure provider uses tensor-parallel prefillers (TP=4) because prefill is compute-bound, and data-parallel decoders spread across up to 16 GPUs (EP=16) because decode is memory-bandwidth-bound and benefits from distributing requests across more devices. The sharding strategy differs because the bottleneck differs (Perplexity, "Hosting Qwen on Blackwell," 2026).
With 50 concurrent streaming requests and slow-consuming clients, the inference serving actor’s memory grew linearly until OOM. Bounded output buffers—a simple asyncio queue with a maximum size—prevented the cascade. Any streaming inference deployment without backpressure is a memory leak waiting for slow clients.
Worked Example
RAG pipeline: 8,500 input tokens, 250 output tokens, Anthropic Claude Sonnet.
Using illustrative rates (exact rates in pricing snapshot):
Component |
Tokens |
Rate |
Cost |
|---|---|---|---|
Input (uncached) |
3,000 |
p_in |
3,000 × p_in |
Input (cached) |
5,500 |
0.1 × p_in |
550 × p_in equivalent |
Output |
250 |
5 × p_in |
1,250 × p_in equivalent |
Even with 65% of input tokens cached, the 250 output tokens (about 3% of all tokens) account for roughly 26% of the total request cost. Output economics dominate per-token despite being the minority of tokens.
Exact multipliers vary by provider and model; the ratio shape is what matters here.
If output tokens dominate your workload, token price is not the primary lever. Reducing output length—through better prompts, structured output, or stopping criteria—can improve both latency and cost more than switching providers. If input tokens dominate and they are cacheable, prompt caching is the primary lever.
What To Measure
Input/output token ratio per workload
p50 and p95 output token length distribution
TTFT breakdown (is prefill or queue time dominant?)
TPOT under production concurrency
Cache hit rate on input tokens
Chunked prefill can interleave prefill and decode work, blurring the phase boundary. TTFT improves but decode slots may compete with prefill chunks.
Speculative decoding generates multiple draft tokens per step, partially amortizing decode overhead. When it works.
Batch APIs process work asynchronously with different pricing and no TTFT constraint. The prefill/decode cost asymmetry still exists but is absorbed into a flat batch discount.
Multimodal inputs (images, audio) have different tokenization and prefill characteristics than text.
Calculator Hook
The LCPR calculator separates input and output token costs and shows the prefill/decode cost split. The sensitivity analysis reveals whether reducing output length or increasing cache hit rate has more impact on unit cost.
Chapter 6: Memory Economics
Field Problem
Same model, same hardware, same provider tier. The only change: conversations scale from 2K context to 32K-context document analysis. Throughput drops 75%. Cost per request quadruples. The provider has not changed anything. The KV cache ate the memory.
Mechanism
Every active sequence stores key and value tensors for every token it has seen—prompt tokens plus generated tokens—across every layer of the model. This is the KV cache. It is the largest dynamic memory allocation in inference serving.
Symbol |
Meaning |
|---|---|
2 |
Key + Value |
N_layers |
Transformer layers |
H_kv |
Key/value heads (fewer than query heads with GQA/MQA) |
D_head |
Head dimension |
E_kv |
Bytes per element (FP16=2, FP8=1) |
The total KV memory for all live sequences:
M_total = 2 × N_layers × H_kv × D_head × E_kv × N_seq × T
Where N_seq is live sequences and T is prompt + generated tokens per
sequence.
Available KV memory:
M_avail = M_hbm − M_weights − M_overhead
Maximum concurrent sequences:
N_seq_max = M_avail / (2 × N_layers × H_kv × D_head × E_kv × T)
Naive Answer
“Context length is a model setting. Longer context is better.”
Context length is a memory allocation. Every token of context consumes KV memory on every layer. Longer context means fewer concurrent sequences. Fewer concurrent sequences means lower throughput. Lower throughput at the same hardware cost means higher cost per token.
Better Model
KV cache creates two pressures:
Capacity pressure: Can the KV for all concurrent sequences fit in HBM? If not, sequences get preempted (evicted and later recomputed) or requests queue.
Bandwidth pressure: During each decode step, the attention mechanism reads KV state for every resident token. More tokens, more bytes read, more bandwidth consumed.
Both pressures increase with context length. But capacity pressure hits first and harder because it determines whether the sequence can be served at all, not just how fast.
Worked Example
Llama-3 8B (32 layers, 8 GQA KV heads, 128 dim) on A100 80GB:
Model weights (FP16): ~16 GB. Runtime overhead: ~4 GB. KV pool: ~60 GB.
Context length |
Max live sequences |
Concurrent capacity |
|---|---|---|
2K tokens |
~234 |
Plenty of headroom |
8K tokens |
~58 |
Comfortable for most workloads |
32K tokens |
~14 |
Tight; long requests evict others |
128K tokens |
~3 |
Only 3 concurrent conversations |
With FP8 KV quantization (E_kv = 1 byte): all numbers double. FP8 KV is an economic lever, not just a precision choice. Operators report that microscaling FP8 (MXFP8) outperforms block-scaled FP8 on Blackwell, and that 4-bit MXFP4 without quantization-aware training degrades accuracy enough to disqualify it for most production workloads. Each step down in precision doubles capacity but narrows the workloads that tolerate the quality loss (Perplexity, "Hosting Qwen on Blackwell," 2026).
Illustrative—actual capacity depends on block allocator, runtime, prefix cache, and scheduler.
A B2C real-time voice deployment (conversational AI customer support, 70B model on a dedicated pool serving roughly 240 concurrent agent sessions at peak) watched p99 inference latency drift from 920ms to 1,480ms over a 36-hour window. Mean and median were flat. The drift was confined to the tail.
The hypothesis chain ran for two days. We suspected a noisy-neighbor effect on a shared host (disproven: dedicated pool, single tenant). We suspected weight-paging pressure from a recent context-window bump (disproven: HBM headroom was 14GB at the worst point). We suspected scheduler starvation from long agent sessions (closer, but not the root).
The actual cause surfaced when we instrumented block-allocator residency: long-running agent threads were holding partial-prefix cache lines that newer sessions could not evict cleanly. The block allocator was leaving small unusable holes between live KV regions. Each new session arrived to a more fragmented pool than the last, and prefill had to chase progressively further to find contiguous blocks. Nothing was leaking; the pool was just deteriorating in shape.
The fix was unglamorous: a server-side restart every 18 hours dropped p99 back to baseline. Each restart cost about 7 minutes of unavailability (drained, restarted, warmed). We ran the numbers against SLO-credit exposure from a sustained 1,480ms tail and the math went the right way: scheduled restarts were cheaper than letting drift accumulate. Restart-as-policy is not elegant, but it is sometimes the right call when fragmentation is sub-quadratic in time and the workload is bursty enough that a 7-minute drain hits during a natural lull.
Side-finding that did not fit the runbook: a few weeks later we moved long-running agent traffic to a separate pool with shorter session TTLs, and the drift on the main pool disappeared entirely. The drift was a workload-mix problem more than a fragmentation problem. The two failure modes (long-session KV fragmentation, mixed-workload eviction pressure) look similar from p99 telemetry; only the pool-segmentation experiment disambiguated them. The restart cadence on the segmented pools dropped to once a week, and we eventually retired it.
KV fragmentation is real. Paged attention managers (PagedAttention, block allocators in TensorRT-LLM) reduce waste, but allocation overhead and fragmentation still accumulate during long runtime. The theoretical max from the formula above is an upper bound, not an operating target: expect 20-35% effective capacity degradation under sustained mixed traffic before fragmentation forces a restart or a workload-segmentation decision, not a 16x cliff.
In an enterprise RAG deployment, mixing short-context and long-context requests on the same GPU pool caused KV cache eviction cascades. The monitoring signal was cache usage near 100%, preemptions spiking, and running requests dropping simultaneously. The fix was workload-aware routing: short context to one pool, long context to another. Same hardware, different routing, different economics.
Long Context and the Memory Wall
The KV cache formula above changes when context length grows from conversation-scale to document-scale.
Each document is 50-100 pages, tokenized to 30,000-80,000 tokens. The model supports 128K context, so the document analysis product sends the full text. Longer context does mean better analysis. But it also means: each request consumes 5-10x more KV memory, concurrent requests drop from 40 to 4 on the same hardware, TTFT increases because prefill scales with prompt length, and cost per request is 5-10x higher even if the token price is the same.
Long context is a memory product. The team bought it without understanding the bill.
KV memory scales linearly with context length:
M_total ∝ T
Doubling context length T doubles KV memory and halves the number of concurrent sequences at the same HBM capacity. But the impact is worse than 2x because:
Prefill time scales with prompt length. Longer prompts mean longer prefill, which means higher TTFT. For compute-bound prefill on long prompts, TTFT can scale superlinearly with prompt length if attention is quadratic (though FlashAttention and variants reduce this).
KV bandwidth scales with context. Each decode step reads KV for all resident tokens. Longer context means more bytes read per step, which means each decode step takes longer, which means TPOT can increase.
Cache effectiveness drops. If each request has a unique 60K-token document as its prefix, prefix caching provides no benefit—every request has a different prefix. Long-context workloads are often low-reuse workloads.
Context-tier pricing. Some providers charge more for long-context requests. This pricing reflects the serving cost: more memory consumed for longer duration, less concurrency available for other requests.
“Long context is just a better model. It understands more.”
Long context does enable better analysis of long documents. But it changes the economics. A workload that sends 80K tokens per request at batch 4 can cost 10x more per request than the same workload chunked into 8 requests of 10K tokens—and chunking may produce comparable quality for retrieval and summarization tasks, depending on the task.
Long context changes three economic variables simultaneously:
Capacity cost: More memory per request → fewer concurrent requests → higher cost per request at the same hardware budget.
Latency cost: Longer prefill → higher TTFT → risk of violating TTFT SLO. More KV to read per decode step → higher TPOT.
Quality return: Some tasks genuinely benefit from the model seeing the full document. Others work equally well with retrieval + shorter context. The right approach depends on the task, not the context window.
The decision is not “long or short context.” It is: does the quality improvement from longer context justify the capacity, latency, and cost increase?
Llama-3 8B on A100 80GB (continuing from above):
KV pool: ~60 GB. kv_bytes_per_token: 128 KB.
Workload shape |
Context |
Concurrent |
TTFT impact |
Cost per request |
|---|---|---|---|---|
Support chat |
2K |
~234 |
Low |
Low |
RAG answer |
8K |
~58 |
Moderate |
Moderate |
Document QA |
32K |
~14 |
High |
~4x support chat |
Full doc analysis |
128K |
~3 |
Very high |
~16x support chat |
Illustrative throughput; queueing behavior dominates at the upper bound.
At 128K context, a single A100 can serve 3 concurrent requests. Assuming each request takes 30 seconds and the GPU runs at 65-80% effective utilization (typical for bursty long-context traffic, not the theoretical 100%), the GPU handles roughly 230-290 completed requests per hour. At $1.59/hr, that puts the GPU cost per request in the $0.006-$0.008 range—before quality gate, retries, and overhead. The naive $0.004 number assumes perfect packing and zero queueing slack, which long-context workloads almost never sustain. Any traffic spike beyond 3 concurrent requests causes queuing, preemption, or failure.
Before choosing long context: measure whether the quality improvement justifies the cost. Run the task at multiple context lengths. If 8K context with good retrieval produces 90% of the quality of 64K full-context, the 4-8x cost difference should inform the architecture choice.
For workloads that genuinely need long context, budget for the memory. Price the capacity from KV requirements, not from model weight sizing.
Calculate KV bytes per token for your model. Multiply by your p90 context length and your target concurrent sequences. If the result exceeds your KV pool, you need one or more of: shorter context, FP8 KV, more GPUs, tensor parallelism, or workload-aware routing that separates short and long context.
What To Measure
gpu_cache_usage_perc(vLLM) or equivalent KV utilization metricnum_preemptionsor KV eviction countContext length distribution: p50, p90, p99
Output length distribution (output tokens also consume KV while resident)
Concurrent active sequences over time
Quality delta between chunked short-context and full long-context approaches
TTFT and TPOT by context length bucket
KV utilization under long-context traffic
Cost per accepted result by context length tier
MQA, GQA, MLA, sliding-window attention, and sparse attention all change effective H_kv or T_resident per layer.
Block/paging managers reduce waste but have allocation overhead and metadata memory.
Beam search, n-samples, tool branches, and parallel calls multiply live sequences.
CPU/host KV offload changes capacity but adds retrieval latency.
Cross-request prefix reuse saves prefill compute but still consumes KV memory for the shared prefix.
Sliding-window attention caps the effective KV per layer, reducing memory scaling at the cost of losing distant attention.
MLA and other KV compression techniques reduce per-token KV size, changing the capacity curve.
KV offload to host memory or NVMe can extend capacity but adds retrieval latency.
Flash Attention and variants reduce prefill memory and compute scaling but do not eliminate it.
Some workloads (legal document analysis, genomics, codebase understanding) genuinely need long context and cannot chunk without quality loss. For these, long context is not optional; it is a capacity requirement.
Calculator Hook
The KV capacity analysis takes model config, KV dtype, available HBM, and your context-length distribution. It outputs max concurrent sequences by context length, memory pressure alerts when your traffic would exceed capacity, and long-context cost multipliers.
Chapter 7: Prompt Caching and Rematerialization
Field Problem
Cache hit rate starts at 85%. Over two weeks, it drops to 40%. Nobody notices because the dashboard shows “caching enabled.” The bill increases 35%. Three engineers spend a week debugging what they think is a traffic spike. The customer support agent sends a 4,000-token system prompt, 2,000 tokens of tool definitions, and the growing conversation history on each turn. The system prompt and tools are identical across requests—perfect for prompt caching. The regression was caused by a one-line change that added a timestamp to the system prompt.
Mechanism
Prompt caching avoids recomputing the KV state for a prefix that has already been processed. When a request arrives and shares an exact prefix with a previously cached request, the serving system loads the cached KV directly from memory instead of running the prefill computation. This saves both compute time (faster TTFT) and compute cost (less GPU work).
The key insight: caching is a memory hierarchy, not a billing discount. The billing discount is the visible effect. The mechanism is KV reuse across requests.
The break-even formula:
N_breakeven = p_write / (p_in − p_read)
Symbol |
Meaning |
|---|---|
p_in |
Uncached input token price |
p_write |
Cache write price (can be > p_in) |
p_read |
Cache read price (< p_in) |
If p_write > p_in, the first request costs more with caching enabled. The savings
come from subsequent reads at p_read. If reuse does not arrive within TTL,
caching is a net cost.
Naive Answer
“Turn on caching. It’s a discount.”
Caching is only a discount when: 1. The prefix is stable (no timestamps, no dynamic state before the cache breakpoint). 2. Requests reuse the prefix within the TTL. 3. The cache hit rate is above the break-even point.
When any of these conditions fails, caching can cost more than uncached serving.
Better Model
Cache economics depend on three variables: stability, reuse rate, and TTL fit.
Stability: The cache key is the exact prefix content. Any change—a different timestamp, a reordered tool list, a modified retrieval chunk, a changed model—invalidates the cache. The cache-eligible prefix must be fixed across requests.
Reuse rate: How many requests share the same prefix within the TTL window? A customer support agent handling 100 concurrent conversations with the same system prompt has high reuse. A personalized recommendation system with per-user context has low reuse.
TTL fit: Does the inter-request gap fit within the cache retention window? If requests arrive every 10 minutes but the TTL is 5 minutes, most requests miss.
Worked Example
Anthropic cache economics (illustrative multipliers from public docs):
5-minute TTL: p_write = 1.25 × p_in, p_read = 0.10 × p_in
1-hour TTL: p_write = 2.00 × p_in, p_read = 0.10 × p_in
OpenAI cache economics:
p_write ≈ p_in, p_read = 0.50 × p_in
OpenAI’s cache is cheaper to enter but saves less per hit (50% vs 90% reduction). Anthropic’s cache costs more to write but saves more per hit. The right choice depends on reuse rate and TTL fit.
Google adds a third dimension: explicit context caches charge storage rent per token-hour. This adds a time cost that grows linearly with TTL. For rarely reused prefixes with long TTL, storage rent can exceed the savings.
Exact multipliers in the pricing snapshot appendix; the direction and rough magnitude are stable across providers.
When prefix caching is active, any telemetry that depends on recomputation—activation capture, token-level latency profiling, coverage metrics—systematically under-samples repeated prefixes. The prefix is served from cache, not recomputed, so the measurement never fires. A 99.5% capture rate can still be biased if the missing samples correlate with workload intensity or prompt structure.
Treat cache hit rate as an operational SLO, not a billing optimization. Monitor it; alert on regressions; when it drops, investigate what changed in the prefix: a timestamp, a tool schema update, a retrieval ordering change, a model version bump. Most cache killers are not malicious. They are engineering changes made by people who do not know the prompt structure matters for caching.
What To Measure
Cache hit rate per workload (cache_read_input_tokens / total input tokens)
Cache write rate (how often new prefixes are written)
Inter-request gap distribution for the same prefix
Cost difference between cached and uncached periods
Prefix stability (did the system prompt or tools change?)
Timestamps, request IDs, user-specific data, or retrieval results before the static prefix invalidate the cache on every request.
TTL shorter than the median inter-request gap means most requests miss.
Provider routing can scatter requests across replicas, reducing cache locality.
Batch mode and enterprise tier may have different cache semantics than real-time mode.
Cache hit rate can look good while quality degrades if the cached prefix includes stale context.
Different providers have different cache-breaking events. Changing tool definitions invalidates the entire cache on Anthropic. Changing the effort/speed setting invalidates it. Model changes invalidate it everywhere.
Calculator Hook
The cache break-even analysis takes the pricing snapshot, prefix tokens, expected reuse rate, and TTL. It outputs break-even calls, savings at given reuse, and waste when reuse fails. The sensitivity analysis shows how cache hit rate changes total cost.
Chapter 8: Model Architecture and Cost
Field Problem
Inference latency is 3x higher than the benchmark predicted. The Mixture-of-Experts model—671B total parameters, 37B active per token—fits in memory, barely. The throughput-per-dollar numbers looked strong. But expert routing creates all-to-all communication traffic between GPUs, and the deployment’s interconnect is too slow to handle it. The benchmarks ran on an NVLink-connected 8-GPU server. The deployment uses PCIe GPUs with Ethernet scale-out.
Mechanism
Mixture-of-Experts models route each token to a subset of “expert” subnetworks. This means three things at once:
Active parameters per token (the experts actually invoked) are a small fraction of total parameters.
Expert dispatch must complete within each decode step, which makes interconnect latency a first-order constraint.
All experts must stay memory-resident across the GPU pool, so total parameters still set the memory bill.
The economic promise of MoE: active compute is much smaller than total parameters, so inference can be faster and cheaper, as long as the experts are in memory and reachable at low latency.
The economic catch: experts may be placed on different GPUs. Routing tokens to their experts and combining results creates all-to-all communication. This communication must complete within each decode step. If the interconnect is slow (Ethernet, PCIe), the communication becomes the bottleneck—not the compute, not the memory bandwidth.
Naive Answer
“MoE has 671B parameters but only activates 37B. It’s basically the cost of a 37B model.”
Not if the 671B parameters need memory residency and expert dispatch needs fast interconnect. The cost is:
Compute: proportional to active parameters (good)
Memory capacity: proportional to total parameters (unchanged)
Memory bandwidth: depends on weight fetch (total resident, but only active read) plus KV
Communication: proportional to expert dispatch traffic and interconnect speed (new cost)
MoE shifts cost from compute to communication and memory residency. Whether that is a net win depends on the topology.
Better Model
The critical question is whether expert dispatch stays inside a low-latency scale-up domain or crosses slower scale-out links.
Topology |
Expert dispatch latency |
MoE economics |
|---|---|---|
8-GPU NVLink (900 GB/s) |
Low |
MoE sparsity advantage realized |
72-GPU NVLink domain (1.8 TB/s/GPU) |
Very low |
Best case for large MoE |
PCIe within server |
Medium |
Partial advantage |
Ethernet across servers |
High |
Sparsity advantage can be lost to communication |
At an enterprise deployment, Deep Packet Inspection on the network path added 5-15ms to every inter-node RPC. For tensor parallelism with thousands of allreduce operations per second, a 70B model ran 10x slower than benchmarks. The fix—a dedicated ML subnet—took six weeks of security reviews. Network topology is a serving physics variable that no benchmark captures.
Expert dispatch on MoE is even more sensitive than tensor-parallel allreduce: traffic is more frequent and less predictable. If the network path is slow, MoE serving pays the penalty on every layer of every decode step.
Worked Example
DeepSeek-V3 style MoE: 671B total, 37B active, 8 experts activated per token.
Weight residency (FP8): 671B × 1 byte = 671 GB. Does not fit on a single GPU. Requires at least 4× H200 (141 GB each, ~564 GB usable) or 8× H100 (80 GB each, ~640 GB usable). Expert placement across GPUs means routing traffic at every MoE layer.
On an 8-GPU NVLink server (900 GB/s bidirectional): expert dispatch for 8 active experts per token across 8 GPUs creates all-to-all traffic. NVLink bandwidth is sufficient for moderate batch sizes. Latency penalty per layer is small.
On Ethernet scale-out (25-100 Gbps): the same expert dispatch creates orders-of-magnitude more relative latency. Expert-to-expert communication that takes microseconds on NVLink can take milliseconds on Ethernet. This accumulates across layers.
Quantitative latency depends on implementation, batch size, and actual routing patterns; the relative shape is robust.
Disaggregated Prefill/Decode and Maturity Gates
MoE is one architectural choice that changes serving economics. Disaggregated prefill/decode is another.
The idea is compelling: separate prefill workers (compute-heavy) from decode workers (memory-bandwidth-heavy) so each can be optimized independently. After three weeks of implementation, latency improves for long-prompt workloads. But overall system reliability drops: KV transfer adds a new failure mode, autoscaling becomes harder because prefill and decode pools scale independently, and debugging multi-hop request paths is more difficult.
The optimization is real. The maturity gate was not met.
In colocated serving, prefill and decode run on the same GPU. A long prefill burst can preempt decode work and spike TPOT for in-flight decode requests. A large decode batch can delay new prefill work and spike TTFT for incoming requests.
Disaggregated serving splits these phases:
Prefill workers process the input prompt and produce KV state.
KV transfer moves the computed KV from prefill worker to decode worker.
Decode workers receive the KV and generate output tokens.
Benefits: - Prefill and decode can use different hardware configurations (more compute for prefill, more bandwidth for decode). - Prefill bursts do not interfere with decode latency. - TTFT and TPOT can be scaled independently. - Each pool can be right-sized for its workload shape.
Costs: - KV transfer adds latency and network bandwidth requirements. - Prefill-to-decode routing becomes a scheduling problem. - Two pools to autoscale instead of one. - New failure modes: KV transfer failure, prefill/decode imbalance, transfer timeout. - Debugging is harder because a single request traverses multiple workers.
In an edge deployment across 100+ locations, T4 GPUs hit thermal limits during peak traffic. Ambient site temperatures of 30-35C pushed GPU temperatures to the 83C throttle point. p99 latency doubled at 40+ sites simultaneously. Power capping from 70W to 60W stabilized tail latency at the cost of 15% peak throughput. The hardware datasheet does not specify operating ambient at the upper end of edge-site reality.
This is not about disaggregation. It is about the gap between benchmark conditions and production conditions. The same gap applies when evaluating disaggregation: benchmark demonstrations run in controlled environments with stable traffic, fast interconnect, and simple failure modes. Production adds thermal throttling, network jitter, traffic bursts, cold starts, and operational complexity.
Disaggregation readiness is a gate, not a toggle. Six conditions should be met before disaggregation is the right optimization:
1. Is TTFT dominated by long-prompt prefill? If most prompts are short (under 2K tokens) and TTFT is already within SLO, prefill is not the bottleneck and disaggregation does not help.
2. Is decode latency being interrupted by prefill bursts? If TPOT is stable under traffic, colocated serving is handling the phase interference well. Chunked prefill (interleaving prefill chunks with decode steps) can reduce this interference without full disaggregation.
3. Is there enough traffic to keep separate pools utilized? Disaggregation creates two pools that each need minimum utilization. If traffic is too low, one pool sits idle while the other is overloaded. This wastes capacity.
4. Can KV transfer fit inside the SLO? If the KV state for a request is 500 MB and the interconnect between prefill and decode workers is 25 Gbps Ethernet, transfer alone takes ~160ms. For a 300ms TTFT target, that is more than half the budget consumed by transfer.
5. Can the platform autoscale prefill and decode independently? If the autoscaler cannot distinguish prefill pressure from decode pressure, it will scale the wrong pool or both pools together—losing the independent scaling advantage.
6. Can the team debug multi-step routing failures? When a request fails, is it because prefill failed, KV transfer timed out, the decode worker rejected the KV, or the routing layer misrouted? If the team cannot diagnose these failure modes quickly, disaggregation adds incident duration.
Disaggregation economics for a long-prompt document analysis workload:
Scenario |
Colocated |
Disaggregated |
|---|---|---|
Mean prompt length |
40K tokens |
40K tokens |
Mean output length |
500 tokens |
500 tokens |
p99 TTFT |
2,800ms (prefill burst interference) |
1,200ms (dedicated prefill pool, inclusive of ~80ms KV transfer) |
p99 TPOT |
45ms/tok |
38ms/tok (no prefill interference) |
KV transfer overhead |
0 |
~80ms per request |
Operational complexity |
Baseline |
2x autoscaling, new failure modes |
GPU utilization |
72% average (but bursty) |
65% prefill, 80% decode |
Net TTFT improvement |
— |
57% |
Infrastructure cost |
Baseline |
+15% (two pools, transfer networking) |
Illustrative; real numbers depend on hardware, interconnect, traffic shape, and framework maturity.
For this workload, disaggregation significantly improves TTFT at a 15% infrastructure cost increase. Whether that is worthwhile depends on whether TTFT is the binding constraint. If the customer cares about total processing time for a batch of documents, the 57% TTFT improvement may not matter.
Do not disaggregate until you have: (1) measured that prefill/decode interference is the binding latency constraint, (2) enough traffic to keep both pools utilized, (3) interconnect fast enough for KV transfer within SLO, (4) independent autoscaling for both pools, and (5) observability for multi-hop request debugging.
Chunked prefill is the first optimization to try. It reduces phase interference without the operational cost of disaggregation.
Before deploying an MoE model, verify that your deployment topology keeps expert dispatch inside a low-latency interconnect domain. If your GPUs are connected by Ethernet or slow PCIe, the communication cost may erase the sparsity advantage. Treat “active parameters” as a compute metric, not an economic summary. Memory residency, communication, and topology are the other three cost dimensions.
Before disaggregating prefill/decode, verify the six readiness conditions above. If the workload is short-prompt or low-traffic, the operational cost exceeds the benefit.
What To Measure
Expert routing skew (are some experts overloaded?)
All-to-all communication time per layer
Total communication overhead as fraction of decode step
NVLink vs Ethernet dispatch latency in your deployment
Whether batch size is limited by KV capacity or expert routing
TTFT breakdown: is prefill time, queue wait, or routing delay dominant?
TPOT stability: does TPOT spike during prefill bursts?
KV transfer time: network bandwidth between prefill and decode workers
Prefill and decode pool utilization separately
Request failure rate by stage (prefill, transfer, decode)
Expert routing policies (top-k, load balancing, auxiliary losses) change communication patterns.
Shared experts (used by every token) create a different memory and compute profile than routed experts.
MLA (Multi-head Latent Attention) and similar techniques can reduce KV requirements per token for MoE models.
Pipeline parallelism can reduce per-stage memory pressure but adds inter-stage latency.
Software stack maturity varies. Expert parallelism kernel support in vLLM, SGLang, and TensorRT-LLM is not identical.
Small deployments (1-4 GPUs) do not have enough hardware to split meaningfully for disaggregation.
Short-prompt workloads (support chat, classification) gain little from disaggregation because prefill is already fast.
Mixed workload clusters may benefit more from workload-aware routing than from full disaggregation.
KV compression and quantization can reduce transfer costs but add compute overhead on the sending or receiving side.
Calculator Hook
The serving physics analysis for MoE takes total parameters, active parameters, expert count, activated expert count, GPU count, and interconnect bandwidth. It outputs whether the deployment is compute-limited, memory-limited, or communication-limited. For disaggregation, input prompt length distribution, output length distribution, current TTFT/TPOT, interconnect bandwidth, and traffic volume. Output: expected TTFT improvement, KV transfer overhead, pool sizing, and pass/fail on each readiness condition.
Chapter 9: Productive Capacity and Routing
Field Problem
A capacity planning team reports 78% GPU utilization and calls the cluster “well utilized.” The product team reports that 30% of requests exceed the p99 latency SLO. The support team reports rising escalation rates. Finance asks why the bill is growing while utilization is high.
78% GPU utilization measures how often the GPU is doing something. It does not measure how much of that work produced accepted output at target latency. The GPU can be busy recomputing evicted KV cache, running speculative decode drafts that get rejected, processing retries, or serving requests that will fail quality gates.
Mechanism
Throughput is work attempted: total tokens generated per second.
Goodput is work completed: accepted tokens generated per second, meeting latency, quality, and reliability constraints.
The denominator is accepted requests. The numerator includes ALL costs—including failed attempts, retries, quality failures, and wasted speculative tokens. This is the connection between serving physics and LCPR: goodput is the bridge between GPU work and accepted economics.
The gap between allocated and productive GPU capacity is large. A 2024 industry survey found the majority of organizations achieve less than 70 percent GPU Allocation Utilization at peak demand, with common figures closer to 10-20 percent. One serverless provider has engineered cold start latency from tens of minutes down to approximately 50 seconds through GPU memory checkpointing and pre-warmed machine buffers, enabling tighter supply-demand matching for inference workloads (State of AI Infrastructure at Scale, 2024; Modal, "Truly Serverless GPUs," 2026).
Naive Answer
“High utilization means the GPU is efficient.”
Utilization can be high while goodput is low. Reasons:
Retries and repairs consume GPU time on work that will not be accepted.
KV cache preemptions and recomputation waste bandwidth on repeated prefill.
Speculative decode drafts consume compute even when the acceptance rate is low.
Quality failures at the model level waste every token from the failed request.
Queue buildup causes SLO violations that are not visible in utilization metrics.
In a high-concurrency interactive workload, speculative decoding with a domain fine-tuned 1B draft model produced roughly 0.90x throughput: slower, not faster. The acceptance rate improved from about 48% to 58%, but under production batch sizes of 10-14 the extra verification compute exceeded the drafting savings. Speculative decoding works best at low concurrency with high acceptance rates; once verification has to run across a wide batch every step, the economics can flip.
The interactive-workload result is workload-dependent, not a general indictment of speculative decoding. A search infrastructure provider reports positive production results with in-house multi-token-prediction draft layers, likely because the search workload's constrained output distribution yields higher draft acceptance rates than open-ended generation. The decision variable is not "does speculative decoding work" but "does your workload's output distribution produce an acceptance rate high enough to offset the draft model's compute under your operating batch size?" (Perplexity, "Hosting Qwen on Blackwell," 2026).
After quantizing a production model, quality failures appeared in the long tail: rare entities, numeric precision errors, and multi-turn conversation degradation after 5-6 turns. Standard A/B tests did not catch these. Continuous log analysis—reviewing real production outputs—was the only reliable detection method. Quantization improves throughput by 1.6x and cuts memory by 3-4x, but the quality risk is in the tail, not the average.
Quantization increases raw throughput and utilization. But if it introduces quality failures that increase retries, human escalation, or silent degradation, the productive capacity—measured in LCPR—may not improve.
Better Model
Replace utilization with productive capacity:
Metric |
Measures |
Useful for |
|---|---|---|
GPU utilization % |
Time GPU is active |
Detecting idle hardware |
Raw throughput (tok/s) |
Tokens generated |
Peak capacity sizing |
Goodput (accepted req/s) |
SLO-compliant work per second |
Production capacity planning |
Cost per accepted unit |
Economics per accepted work |
LCPR-driven decisions |
Worked Example
Two configurations for the same RAG workload, 100 requests each:
Metric |
Route A |
Route B |
|---|---|---|
Raw throughput (tok/s) |
1,200 |
800 |
p99 TTFT |
1,400ms (fails 800ms SLO) |
650ms (passes) |
p99 TPOT |
55ms/tok |
42ms/tok |
Quality pass rate |
72% |
91% |
Requests meeting ALL gates |
58 |
85 |
Goodput (accepted req/s) |
5.8 |
8.5 |
Total cost (100 req) |
$1.10 |
$1.45 |
Cost per accepted result |
$0.019 |
$0.017 |
Route A wins on raw throughput and total cost. Route B wins on goodput and cost per accepted result. A benchmark picks Route A. A goodput frontier test picks Route B.
Illustrative; the goodput-frontier shape holds across most production routing decisions.
Stateful Routing and Cache Locality
One of the most impactful levers for productive capacity is where the work goes.
Eight replicas behind a round-robin load balancer. Cache hit rate drops from 85% to 12%. TTFT increases 4x. The support chatbot’s cache configuration has not changed. The problem is the load balancer: it sends each request to a random replica, so the same conversation’s system prompt is cached on replica 3 but the next turn goes to replica 7, where the cache is cold.
Traditional load balancing treats replicas as interchangeable. Inference replicas are not interchangeable. Each replica holds different KV state.
Inference serving systems maintain per-replica state:
KV cache: Each replica holds KV blocks for recently served prefixes. A cache hit requires routing the request to the replica that holds the matching prefix.
Prefix cache: Some systems hash prefixes and store KV blocks by hash. The same prefix hashed on different replicas creates duplicate storage; routing the request to the right replica avoids recomputation.
Session state: Multi-turn conversations benefit from sticky routing to the replica that holds the conversation’s growing context.
Routing policies for inference:
Policy |
Mechanism |
When it works |
When it fails |
|---|---|---|---|
Round-robin |
Even distribution |
Stateless or batch workloads |
Destroys cache locality |
Least-request |
Route to least-loaded |
When all replicas are cold |
Ignores cache state |
Prefix-aware |
Route by prefix hash |
High-reuse prefixes |
Low-reuse or unique-prefix workloads |
KV-cache-aware |
Route to replica with matching KV |
Multi-turn, high cache benefit |
Requires cache state visibility |
Sticky session |
Route same session to same replica |
Conversational workloads |
Creates hotspots if sessions are uneven |
Fallback |
Route to any when target is saturated |
Burst handling |
Cold start penalty on fallback |
“Use standard HTTP load balancing. It works for web traffic.”
Web server replicas are stateless. Inference replicas are stateful when KV cache, prefix reuse, or session affinity matter. Routing a request to the wrong replica means cache miss → full prefill recomputation → higher TTFT, higher cost, wasted memory on the original replica’s cached state.
Cache-local routing creates two tradeoffs. The first is locality versus load balance: routing to the cache-hot replica improves TTFT and cost but can overload that replica if many requests share the same prefix. The routing layer needs a saturation threshold that routes to the cache-hot replica when it has capacity and spills elsewhere when it does not.
The second is cache diversity versus cache efficiency. If all requests go to the same replica, one replica carries the load and the rest sit idle. If requests are spread evenly, cache hit rates drop. The optimal distribution depends on prefix diversity, traffic shape, and the relative cost of cache misses versus load imbalance.
Support chatbot: 8 replicas, 1 system prompt + tool prefix (6,000 tokens), 10 concurrent conversations.
Routing policy |
Cache hit rate |
Mean TTFT |
Effective cost |
|---|---|---|---|
Round-robin |
~12% (1/8 chance) |
1,200ms |
Baseline |
Prefix-aware |
~85% (routed to prefix holder) |
300ms |
~60% savings on input cost |
Sticky session |
~90% (conversation state preserved) |
250ms |
~65% savings |
Illustrative; exact hit rates depend on traffic shape, TTL, and replica capacity.
The difference between round-robin and prefix-aware routing is 4x TTFT and 60% cost reduction—not from a better model or cheaper hardware, but from routing the request to the right place.
Measure and report goodput under SLO, not raw throughput or utilization. GPU-hour math without goodput under SLO is spreadsheet fiction.
When running benchmarks, define your SLO (TTFT, TPOT, E2E, quality pass rate) before measuring. Run with your actual prompt/output distribution, not synthetic uniform inputs. Use Poisson arrivals for online serving benchmarks, not closed-loop. Report cost per accepted result, not cost per total request.
If your workload has reusable prefixes and you use multi-replica serving, routing policy is an economic lever. Implement prefix-aware or KV-cache-aware routing before spending on more hardware. Measure cache hit rate by replica. If hit rates are low but prefix diversity is low (most requests share a few prefixes), routing is the problem.
What To Measure
Goodput: accepted results per second under TTFT, TPOT, and quality gates
Cost per accepted result (all costs in numerator, accepted only in denominator)
Retry rate and retry cost as fraction of total cost
Quality failure rate by failure mode
Speculative decode acceptance rate and net throughput impact
KV eviction rate and recomputation overhead
Cache hit rate per replica
Prefix overlap percentage (how many requests share the top-K prefixes?)
TTFT on cache hit vs cache miss
Routing spillover rate (requests sent to non-ideal replica due to saturation)
Without quality labels, goodput degrades to “latency-constrained throughput”—still useful, but incomplete.
Streaming workloads care about jitter and inter-token latency, not just mean TPOT.
Batch workloads use completion-window SLOs, not interactive TTFT/TPOT.
Multi-turn conversations have different goodput semantics: is the unit a turn or a session?
Low-reuse workloads (unique documents, personalized prompts) do not benefit from prefix-aware routing because every request has a different prefix.
Scaling up replicas dilutes cache locality. Adding replicas to handle traffic can reduce cache hit rate, counterintuitively increasing per-request cost.
Disaggregated prefill/decode complicates routing because the prefill worker and decode worker may be different—and the KV must transfer between them.
Multi-region deployments may not share cache state, so cross-region routing is always a cold start.
Prefix hash collisions, KV eviction, and cache capacity limits mean cache-aware routing is probabilistic, not guaranteed.
Calculator Hook
The goodput calculator takes per-request latency, quality labels, and cost data. It outputs goodput by request rate, cost per accepted unit, and the recommended operating point. The serving physics analysis includes a cache-locality mode: input prefix diversity, replica count, and traffic shape to see expected cache hit rate by routing policy and cost differences.
Part 2 Summary
Six mechanisms set inference serving cost and capacity: the hardware roofline and the batch-size frontier, the prefill/decode asymmetry, KV memory economics, prompt caching as a memory hierarchy, MoE topology and disaggregation readiness, and productive capacity under SLO. Each is independently load-bearing; together they are the vocabulary the rest of the book uses. When Part 3 calls a support workload "output-bound," it means the prefill/decode economics apply. When Part 4 says "check KV capacity before choosing dedicated," it means the memory-economics formula. When Part 5 says "measure goodput, not throughput," it means the productive-capacity definition.
The physics explains the economics. The economics determine the decisions.
Evidence Notes for Part 2
Claim |
Type |
Source |
|---|---|---|
Roofline model applies as a bound for inference steps |
Public, Derived |
Williams et al. 2009 (CACM), adapted for transformer decode |
Batch amortizes weight fetch but not KV fetch |
Derived, Reported |
First-principles from serving physics; consistent with Reiner Pope transcript framing |
Decode is often memory-bandwidth-bound at moderate batch |
Derived |
Follows from weight-fetch and KV-fetch analysis at typical production batch sizes |
Prefill is parallel and can be compute-bound |
Public |
SARATHI (Agrawal et al. 2023), Orca (Yu et al. 2022) |
Output/input price ratio reflects prefill/decode cost asymmetry |
Inferred |
Observed across providers; consistent with serving physics but includes margin and product strategy |
KV cache sizing formula |
Derived |
Standard transformer architecture formula |
KV fragmentation causes p99 drift |
Reported |
Operator proof box 4 (public blog post) |
KV eviction cascades in mixed short/long workloads |
Reported |
Operator proof box 6 (public blog post) |
Prompt cache break-even formula |
Derived |
From provider-documented write/read pricing |
Provider cache semantics (TTL, write/read pricing) |
Public |
Provider docs; exact values in pricing snapshot appendix |
Prefix caching creates measurement bias |
Reported |
Operator proof box 10 (public research post) |
Long context roughly halves concurrent capacity per doubling |
Derived |
From KV cache sizing formula |
MoE reduces active compute but adds all-to-all communication |
Public, Derived |
DeepSeek-V3 report, NVIDIA NVL72 docs, adapted for serving |
Network DPI can 10x inference latency |
Reported |
Operator proof box 9 (public blog post) |
Speculative decoding can be net negative under production batch |
Reported |
Operator proof box 3 (public blog post) |
Quantization quality regression in long tail |
Reported |
Operator proof box 5 (public blog post) |
Streaming without backpressure is a memory leak |
Reported |
Operator proof box 8 (public blog post) |
Hard latency budgets expose what averages hide |
Reported |
Operator proof box 1 (public blog post) |
T4 thermal throttling at edge |
Reported |
Operator proof box 2 (public blog post) |
Goodput is accepted work per second under SLO |
Derived |
From SLO definition and standard production metrics |
GPU utilization can be high while goodput is low |
Derived, Reported |
First-principles; consistent with multiple operator observations |
Round-robin routing destroys cache locality |
Derived |
From prefix cache mechanism; consistent with provider engineering posts |
Disaggregated P/D requires maturity gates |
Public, Opinion |
vLLM experimental docs, NVIDIA Dynamo docs; readiness conditions are author’s synthesis |
Worked examples (7B on A100, KV sizing, cache break-even, goodput comparison) |
Synthetic |
Illustrative numbers derived from public hardware specs and provider pricing structures |
Disaggregated P/D sharding: TP=4 prefill, EP=16 decode |
Public |
Perplexity, “Hosting Qwen on Blackwell,” 2026 |
MXFP8 works in production; MXFP4 without QAT does not |
Public |
Perplexity, “Hosting Qwen on Blackwell,” 2026 |
Speculative decoding positive result with MTP layers |
Public |
Perplexity, “Hosting Qwen on Blackwell,” 2026 |
GPU Allocation Utilization commonly 10-20%, majority <70% at peak |
Reported |
Modal citing State of AI Infrastructure at Scale, 2024 |
Cold start reduced from tens of minutes to ~50 seconds |
Public |
Modal, “Truly Serverless GPUs,” 2026 |
All worked examples use synthetic numbers shaped by real hardware specs and provider billing grammar. No numbers are attributed to any specific employer, customer, or deployment. Where operator proof boxes are used, they reference published public blog posts with approved anonymized wording.