Appendix

Calculator Manual And Living Reference

A.1 What The Calculator Does

The calculator is the execution layer for this book. The body teaches the mechanism, wrong conclusion, and decision rule. The calculator carries the repeatable math: cost models, sensitivity analysis, break-even gates, and reconciliation views.

The calculator takes four kinds of input:

Traces. Per-request event records from your inference serving system: model, tokens, latency, cache state, quality label, cost.
Invoices. Provider billing data for a matching period: line items, credits, taxes, committed spend, overages.
Evals. Quality labels on outputs: pass, fail, repair needed, human escalation. These set the denominator for accepted work.
Configuration. Workload identity, SLO thresholds, routing policy, pricing snapshot, account revenue. These set the constraints the calculator evaluates against.

The calculator produces views. Each view answers one question. For each view this appendix documents the formula, the inputs, the output, and when not to trust it. The companion app lives at inference-econ.streamlit.app.

Implementation status (which views the app exposes directly vs as templates) lives in the calculator's view registry, not in this prose.

A.2 Calculator Views

The appendix documents fourteen book-facing calculator views. The companion app implements them across thirteen tabs (finance, usage, compliance, and latency share an operating tab). Body prose names views by function; the calculator names each view in view_registry.py for schema stability.

View 1: Workload Profile

Internal name: workload_profile (registry: Workload Profile v1)

Body reference: “workload profile” or “workload identity” (Part 1 Ch 2, Part 3 Ch 10)

What it does: Captures the canonical identity of one workload: class, SLO tier, accepted work unit, token distributions, security constraints, cache eligibility, batch eligibility, owner, and current route.

Inputs:

Field	Source	Required
workload_id	Your naming convention	Yes
workload_class	Taxonomy from Ch 10	Yes
accepted_work_unit	Definition from Ch 2	Yes
slo_class	interactive / near-real-time / background / batch	Yes
ttft_slo_ms	Product requirement	If interactive
e2e_slo_ms	Product requirement	If interactive
input_token_distribution (p50/p95/p99)	Trace analysis	Yes
output_token_distribution (p50/p95/p99)	Trace analysis	Yes
quality_floor	Eval threshold	Yes
security_constraints	Compliance team	Yes

Output: A structured workload record that feeds every other calculator view.

When not to trust it: If token distributions come from fewer than 100 traces, the p95 and p99 are unreliable. If the quality floor has not been validated with human review, the accepted-work denominator is wrong.

View 2: Trace Event Schema

Internal name: trace_event_schema (registry: Trace Event Schema v1)

Body reference: “trace event schema” or “request event fields” (Part 1 Ch 3, Part 5 Ch 25)

What it does: Defines the minimum fields per inference request needed for economics-grade analysis. This is a schema definition, not a dashboard.

Minimum fields per event:

Field	Type	Purpose
request_id	string	Deduplicate and join
timestamp	ISO 8601	Ordering and window assignment
workload_id	string	Attribution
model_id	string	Pricing lookup
provider	string	Invoice join
input_tokens	int	Cost numerator
output_tokens	int	Cost numerator
cached_input_tokens	int	Cache economics
ttft_ms	float	SLO gate
e2e_ms	float	SLO gate
quality_label	enum	Accepted-work denominator
is_retry	bool	Retry cost allocation
cache_hit	bool	Cache hit rate
route_id	string	Multi-source attribution
cost_usd	float	Trace-derived cost

When not to trust it: If cached_input_tokens is missing or always zero, cache economics calculations will be wrong. If quality_label is missing, the calculator cannot distinguish accepted work from raw throughput.

View 3: Latency Decomposition

Internal name: latency_decomposition (registry: Latency Decomposition v1)

Body reference: “latency decomposition” or “timing breakdown” (Part 2 Ch 4-7)

What it does: Breaks end-to-end latency into components: queue time, TTFT (prefill), inter-token latency (decode), tool execution, retrieval, and orchestration overhead.

Formula: T_e2e = T_queue + T_ttft + (T_tpot × N_out) + T_tool + T_retrieval + T_orchestration.

Output: Per-request timing breakdown showing which component dominates.

When not to trust it: If the serving engine does not expose queue time separately, the calculator cannot distinguish queueing from prefill. If tool calls are asynchronous, T_tool may overlap with T_decode.

View 4: SLO-to-Route Mapping

Internal name: slo_to_route_mapping (registry: SLO-to-Route Mapping v1)

Body reference: “SLO-to-route mapping” or “routing policy” (Part 3 Ch 10, Part 4 Ch 20)

What it does: Maps each workload’s SLO constraints to eligible routes. A route is eligible only if it passes all gates: latency, quality, security, and cost.

Inputs: Workload profile (View 1), route candidates with measured latency and quality, security constraints.

Decision rule: A route is eligible if measured TTFT p95 ≤ ttft_slo_ms, measured E2E p95 ≤ e2e_slo_ms, measured eval pass rate ≥ quality_floor, and the route satisfies every security gate in the workload profile.

Output: Eligible route set per workload, ranked by cost per accepted work.

When not to trust it: If latency was measured under synthetic load and production concurrency is higher, p99 will be worse than measured. If quality was measured on a different prompt distribution, eval pass rate may not transfer.

View 5: Cost Per Accepted Work (LCPR)

Internal name: cost_per_accepted_work (registry: Cost Per Accepted Work v1; also exposed in the app as LCPR via the Compare tab)

Body reference: LCPR (Part 1 Ch 1-2, every worked example)

What it does: Calculates loaded cost per result, normalized to accepted output.

Formula (Derivation 6): LCPR = (C_inference + C_eval + C_human + C_ops + delta) / A

Where: C_inference is the sum of per-request costs from traces (first attempt + retry + repair, with cache pricing applied); C_eval is the eval-grader cost (LLM grader calls and rubric checks); C_human is human review, escalation, and repair labor; C_ops is amortized operational overhead (monitoring, on-call, deployment) allocated to the workload; delta = C_invoice − C_trace is the discrepancy between trace-derived cost and the provider invoice for the same period; and A is accepted work units (requests passing quality, latency, and reliability gates).

What the result means: LCPR is the real unit cost your business operates on. If LCPR exceeds revenue per accepted unit, the workload loses money regardless of what the token price page says.

When not to trust it: If human escalation cost is excluded, LCPR understates the real cost. If accepted work is defined loosely (no quality gate), LCPR overstates efficiency. If the invoice window does not match the trace window, delta will be noisy.

View 6: Spend Movement Waterfall

Internal name: spend_movement_waterfall (registry: Spend Movement Waterfall v1)

Body reference: “spend movement waterfall” or “month-over-month spend” (Part 5 Ch 26)

What it does: Decomposes month-over-month spend change into contributing factors: volume, mix, price, cache rate, quality, retry rate, and new workloads.

Output: A waterfall showing how much of the spend change is explained by each factor. The unexplained residual is the investigation target.

When not to trust it: If workload attribution is inconsistent across months, the mix column will absorb real changes as apparent mix shifts.

View 7: Commitment Utilization

Internal name: commitment_utilization (registry: Commitment Utilization v1)

Body reference: “commitment utilization” or “credit coverage” (Part 4 Ch 18, Part 5 Ch 26)

What it does: Tracks committed spend against actual usage. Shows burn rate, projected exhaustion date, overage exposure, and utilization percentage.

When not to trust it: If credits apply to some models but not others, and the calculator does not model credit scope, utilization will be wrong.

View 8: Variance Analysis

Internal name: variance_drilldown (registry: Variance Drilldown v1)

Body reference: “variance analysis” or “cost/latency root cause” (Part 5 Ch 25, 26)

What it does: Identifies the top contributors to cost or latency variance between periods or between expected and actual. Decomposes variance into rate (price change), volume (traffic change), mix (workload shift), efficiency (cache rate, retry rate, output length), and quality (eval pass rate change).

When not to trust it: If traces have inconsistent model aliases across periods, rate variance will appear where none exists.

View 9: Account Margin Model

Internal name: account_margin_model (registry: Account Margin Model v1)

Body reference: “account margin model” or “account economics” (Part 5 Ch 25, 26)

What it does: Joins LCPR, account revenue, support cost, and operational overhead to calculate gross margin per account or per workload.

Formula (Derivation 6 extended): margin = (R_account − LCPR × A − C_support − C_overhead_allocated) / R_account, where R_account is account revenue, A is accepted work units, C_support is account-specific support cost, and C_overhead_allocated is the allocated share of platform overhead.

When not to trust it: If account revenue includes bundled products that are not inference-related, margin attribution is ambiguous.

View 10: Usage Signals

Internal name: usage_signals (registry: Usage Signals v1)

Body reference: “usage signals” or “expansion/risk signals” (Part 5 Ch 26)

What it does: Tracks leading indicators of workload health: volume trend, cache hit rate trend, retry rate trend, p95 latency trend, eval drift, context length growth, and provider error rate.

When not to trust it: If telemetry sampling rate changes between periods, volume trends will reflect sampling, not usage.

View 11: Security and Compliance Filter

Internal name: security_compliance_filter (registry: Security and Compliance Filter v1)

Body reference: “security/compliance route filter” (Part 3 Ch 10, Part 4 Ch 21)

What it does: Evaluates each candidate route against security constraints before any performance or cost comparison. A route that fails any security gate is ineligible regardless of price.

Gates: data residency, zero-data-retention, private networking, model license, logging/audit requirements, regulatory scope (HIPAA, SOC2, etc.).

When not to trust it: If security requirements are assumed rather than confirmed with the compliance team, the filter may be too permissive or too restrictive.

View 12: Cache Break-Even

Internal name: cache_break_even (registry: Cache Policy Gate v1)

Body reference: “cache break-even analysis” or “cache eligibility” (Part 2 Ch 7)

What it does: Calculates how many requests must reuse a cached prefix within TTL for caching to save money.

Formula (Derivation 3): N_break_even = P_write / (P_uncached − P_read), where P_write is the cache write premium per million input tokens, P_uncached is the standard input price per million tokens, and P_read is the cache read price per million tokens. Cache saves money only when the prefix is reused at least N_break_even times within TTL.

Inputs: Cache write price, cache read price, uncached input price, TTL, measured inter-request gap, prefix stability.

What the result means: If your measured reuse count within TTL is below break-even, caching costs more than it saves.

When not to trust it: If the prefix contains any volatile content (timestamps, request IDs, retrieval results before the static prefix), the effective hit rate will be lower than measured.

View 13: KV Memory Sizing

Internal name: kv_memory_sizing (registry: KV Capacity Envelope v1)

Body reference: “KV memory sizing” or “context-length capacity” (Part 2 Ch 6)

What it does: Calculates maximum concurrent sequences at a given context length on a given GPU.

Formula (Derivation 2): N_max = (HBM_total − M_weights − M_activations) / (2 × L × H_kv × d_head × ctx × bytes_per_elem), where HBM_total is per-GPU memory, M_weights is model weight memory, M_activations is per-sequence activation reserve, L is the number of transformer layers, H_kv is the number of KV heads (after GQA), d_head is per-head dimension, ctx is sequence context length, and bytes_per_elem reflects KV dtype (fp16 = 2, fp8 = 1).

What the result means: This is the hard ceiling on concurrent conversations for a given model, hardware, and context length. The production-safe limit is lower after accounting for headroom, burst, and preemption avoidance.

When not to trust it: If the model uses MLA, sliding-window attention, or cross-request prefix sharing, the formula overestimates per-token KV cost. If beam search or n-sampling is active, live sequences multiply beyond user-facing concurrency.

View 14: Dedicated Break-Even

Internal name: dedicated_break_even (registry: Dedicated Utilization Gate v1)

Body reference: “dedicated capacity break-even” (Part 4 Ch 18)

What it does: Calculates the minimum utilization at which dedicated capacity is cheaper than serverless.

Formula (Derivation 4): u_required = (C_dedicated_hourly + C_ops_hourly) / (G_slo × P_serverless × 3600), where C_dedicated_hourly is the dedicated capacity hourly rate, C_ops_hourly is allocated operational overhead per hour, G_slo is goodput in accepted requests per second under the SLO, and P_serverless is the equivalent serverless price per accepted request. Dedicated is cheaper only when actual utilization exceeds u_required.

What the result means: If your actual weekly-average utilization is below u_required, stay serverless or use hybrid (base dedicated + burst serverless).

When not to trust it: If goodput (G_slo) was measured with synthetic traffic that does not match production prompt/output distributions, the break-even will be optimistic. If operational overhead (C_ops) excludes on-call, deployment tooling, and incident response, the gate is too permissive.

A.3 Running The Worked Examples

The companion repo contains three worked examples. Each has a calculator seed that provides the inputs and a workload identity file that describes the workload. To run an example:

Copy the seed file.
Fill null fields from source-snapshots/2026-05-12/providers.yaml or your own pricing data.
Load the seed into the calculator.
Review the views listed in the seed’s calculator_views field.

Example 1: Support Answer trace-to-loaded-cost

Seed: examples/support-answer.trace-margin.v1/calculator-seed.yaml

Question it answers: What does one accepted support answer actually cost after retries, eval failures, cache misses, and human escalation?

Views exercised: LCPR, trace-to-loaded-cost review, cache policy gate.

What to look for: - The gap between naive cost (inference cost / total queries) and loaded cost (LCPR). The seed mirrors the opener fixture: a regional bank answer-drafting workload at ~30,247 first-attempt queries per workday, 28,674 accepted answers, $503/day inference, $5,287/day loaded cost, $0.184 LCPR, $0.0166 naive per-query cost, and a roughly 10x loaded-to-naive ratio. - The cache hit rate delta. The seed shows expected 60% vs actual 35%. The cache break-even view shows whether caching is still net positive at 35%. - Eval pass rate. Roughly one in four attempts does not produce accepted work, but every attempt costs money.

Example 2: Coding Agent Task Lifecycle

Seed: examples/coding-agent.lifecycle.v1/calculator-seed.yaml

Question it answers: What does one accepted bug fix cost across an entire agent session with sub-agent calls, tool loops, compaction, and retries?

Views exercised: LCPR, cache policy gate, goodput frontier test.

What to look for: - Total input tokens processed (~3.5M) vs output tokens (~85K). The input-to-output ratio is roughly 41:1. Input cost dominates unless cache hit rate is high. - Cache hit rate difference between main agent (82%) and sub-agents (45%). Sub-agents start with fresh context and cannot reuse the main agent’s prefix. - Compaction events as cost events. Each compaction discards cached context and forces a cache write on the next turn. - The overall acceptance rate (90%) includes first-pass success (65%) plus repaired tasks (25%). The 10% that require manual takeover have human cost outside the inference bill.

Example 3: Benchmark Audit

Seed: examples/support-rag-answer-drafting.audit.v1/calculator-seed.yaml

Question it answers: Does the benchmark winner change when methodology errors are corrected?

Views exercised: goodput frontier test, routefit matrix, LCPR.

What to look for: - The naive benchmark uses closed-loop arrivals, cold cache, excluded cold starts, no retries, and mean latency only. The corrected benchmark uses Poisson arrivals, warm cache, included cold starts, retry policy, and p99 latency. - Route A wins on mean throughput. Route B wins on goodput (accepted work per second under SLO). The missing metadata checklist in the seed documents exactly which methodology gaps created the false winner. - The eval pass rate gap: 72% (Route A) vs 91% (Route B). Quality is the largest contributor to the cost-per-accepted-work reversal.

A.4 Source Snapshot Schema

Provider prices belong in dated snapshots, not in body prose. The snapshot schema lives in source-snapshots/<YYYY-MM-DD>/providers.yaml (alongside hardware.yaml, cache-semantics.yaml, benchmark-sources.yaml, and model-licenses.yaml).

Required fields per provider entry:

Field	Type	Purpose
provider	string	Provider name
model	string	Model identifier
pricing_tier	string	online / batch / dedicated
input_per_mtok	float	USD per million input tokens
output_per_mtok	float	USD per million output tokens
cache_write_per_mtok	float or null	Cache write premium
cache_read_per_mtok	float or null	Cache read price
cache_ttl	string or null	TTL specification
context_window	int	Maximum context length
batch_discount_pct	float or null	Batch API discount
source_url	string	Official pricing page URL
accessed_date	string	ISO 8601 date of last verification
notes	string or null	Semantic caveats

Refresh protocol:

Before any publication event, visit each source_url and verify prices.
If a price has changed, create a new snapshot directory with the current date.
Do not overwrite old snapshots. They are the historical record.
Update calculator seeds to reference the new snapshot.

Evidence type for snapshot data: PUBLIC when prices come from official pricing pages. If a price is inferred, modeled, reported by a third party, or unavailable, label it explicitly and add a note explaining the evidence source.

A.5 Formulas Reference

The six derivations in the body are the mathematical spine of the calculator. Each derivation kills a wrong conclusion that survives in naive analysis.

Derivation 1: Batch Amortization / Roofline (Part 2, Ch 4-5)

Wrong conclusion: “GPU-hour divided by peak tokens/sec is my cost.”

Calculator views: goodput frontier test, serving physics lens.

Derivation 2: KV Cache Memory Sizing (Part 2, Ch 6)

Wrong conclusion: “Context length is just a model setting.”

Calculator view: kv capacity envelope.

Derivation 3: Prompt-Cache Break-Even (Part 2, Ch 7)

Wrong conclusion: “Caching is a discount.”

Calculator view: cache policy gate.

Derivation 4: Dedicated Utilization (Part 4, Ch 18)

Wrong conclusion: “Dedicated is cheaper because the hourly rate looks low.”

Calculator view: dedicated utilization gate.

Derivation 5: Goodput Under SLO (Part 5, Ch 24)

Wrong conclusion: “Peak throughput is capacity.”

Calculator view: goodput frontier test.

Derivation 6: trace-to-loaded-cost Reconciliation (Part 5, Ch 25)

Wrong conclusion: “The provider bill is my margin model.”

Calculator view: trace-to-loaded-cost review.

A.6 Glossary

Terms are defined at first use in the body. This glossary collects them for reference.

Accepted work unit. A completed inference result that passes all quality, latency, and reliability gates. The denominator for LCPR. Defined in Part 1 Ch 2.

Batch size. The number of sequences processed simultaneously in one scheduler step. Larger batches amortize weight fetch but increase KV memory pressure and can degrade tail latency. Defined in Part 2 Ch 5.

Cache hit rate. The fraction of input tokens served from prefix cache rather than recomputed. An economic variable, not only a performance metric. Defined in Part 2 Ch 7.

Compaction. Summarizing or truncating an agent’s conversation history to fit within the context window. A cost event (discards cached context, forces cache write) and a quality event (summary may lose information). Defined in Part 3 Ch 12.

Decode. The autoregressive generation phase where output tokens are produced one at a time. Often memory-bandwidth-bound. Defined in Part 2 Ch 5.

Delta. The discrepancy between trace-derived cost and provider invoice for the same period. Investigate if delta exceeds 5%. Defined in Part 5 Ch 25.

E2E latency. End-to-end latency from request arrival to last output token delivered. Defined in Part 2.

Eval pass rate. The fraction of outputs passing the quality gate. Directly affects the accepted-work denominator and therefore LCPR. Defined in Part 5 Ch 23.

Goodput. Accepted requests per unit time, where “accepted” means passing latency, quality, and reliability gates. The correct capacity metric. Defined in Part 5 Ch 24.

GQA (Grouped Query Attention). An attention variant where multiple query heads share fewer KV heads, reducing KV cache memory. Defined in Part 2 Ch 6.

HBM (High Bandwidth Memory). GPU memory. Capacity determines how many weights and KV cache entries fit. Bandwidth determines how fast they can be read. Defined in Part 2 Ch 4.

KV cache. Key and value tensors stored per layer per token for each live sequence. The hidden memory bill of inference. Defined in Part 2 Ch 6.

LCPR (Loaded Cost Per Result). The book’s core economic unit. Total loaded cost (inference + invoice delta + evals + repairs + human escalation + ops overhead) divided by accepted work units. Defined in Part 1 Ch 2.

MoE (Mixture of Experts). A model architecture where only a subset of parameters is active per token, improving compute efficiency but creating routing, all-to-all communication, and memory-capacity tradeoffs. Defined in Part 2 Ch 8.

p99 latency. The latency at the 99th percentile. The tail that SLOs must account for. Averages hide p99. Defined in Part 2 Ch 4.

Prefill. The phase where all input tokens are processed in parallel to populate the KV cache. Compute-heavy. Defined in Part 2 Ch 5.

Prefix caching. Reusing KV cache entries from a previous request that shares the same prompt prefix. An economic mechanism when reuse exceeds break-even. Defined in Part 2 Ch 7.

Roofline. A model that identifies whether a computation is limited by compute throughput or memory bandwidth. Applied to inference to explain why cost changes with batch size. Defined in Part 2 Ch 4.

Route. A specific combination of provider, model, serving mode, and configuration used to serve a workload. Defined in Part 3 Ch 10.

Routefit matrix. The framework for mapping workload identity and SLO constraints to eligible routes, ranked by cost per accepted work. Defined in Part 4 Ch 15.

Serving physics. The hardware and software constraints that determine inference cost and latency: memory bandwidth, compute throughput, KV capacity, batch scheduling, queueing, and topology. Defined in Part 2.

SLO (Service Level Objective). The internal target for latency, quality, and reliability. Not the same as an SLA (contractual commitment to customers). Defined in Part 1.

Snapshot. A dated record of provider prices, hardware specs, or model availability. Snapshots are immutable. New observations create new snapshots. Defined in this appendix.

TPOT (Time Per Output Token). Latency per generated token during decode. TPOT times output length determines decode duration. Defined in Part 2 Ch 5.

Trace. A per-request record containing tokens, latency, cost, quality label, and routing metadata. The atomic unit of inference economics analysis. Defined in Part 1 Ch 3.

Trace autopsy. The diagnostic walkthrough from one trace to root cause. The book’s primary diagnostic framework. Defined in the Trace Autopsy article (see front matter cross-links).

Trace-to-margin review. The framework for joining traces, invoices, evals, and account data to calculate margin. The book’s reconciliation instrument. Defined in Part 5 Ch 25.

TTFT (Time To First Token). Latency from request arrival to first output token. Dominated by prefill computation and queue time. Defined in Part 2 Ch 5.

Utilization. The fraction of provisioned capacity that is doing productive work. High utilization can hide unproductive recomputation, allocation waste, or queueing. The correct metric is goodput, not utilization. Defined in Part 2 Ch 9.

Workload class. A category of inference work where the economics, SLOs, failure modes, and operating decisions differ enough to justify separate treatment. Defined in Part 3 Ch 10.

A.7 What Lives In The Repo

The companion repo extends this appendix with material that does not belong in static pages:

Repo artifact	Purpose	Refresh cadence
`source-snapshots/`	Dated provider pricing snapshots	Before any publication event
`examples/`	Worked example seeds and workload identities	When examples are updated
`calculator/`	Calculator source code and tests	Ongoing
`calculator/view_registry.py`	Canonical app/view inventory and evidence status	When views change
`schemas/`	Planned YAML schemas for workload identity, trace event, calculator seed	When formal schemas are added
`research/`	Research notes, derivation drafts, and claim matrices	Archived after publication

Evidence note [A-1]: All calculator formulas are DERIVED from assumptions shown in the body. All prices in source snapshots are PUBLIC. All worked example data is SYNTHETIC, shaped by public sources but not measured from production systems.

End of Appendix.

Colophon

The Inference Field Guide was set in three typefaces from Google Fonts: Instrument Serif (italic display, chapter titles), Newsreader (body), and JetBrains Mono (numerics, ruled labels, code).

Paper: #faf5e9 — parchment, warm cream
Primary ink (moss): #3A4F2A — headlines, rules, mad-libs chips, primary data, ornaments
Secondary ink (oxblood): #5C2A1E — sidenotes, "Where this breaks" callouts, evidence tags
Display face: Instrument Serif (italic) — chapter titles, landing
Body face: Newsreader (opsz 6–72) — running prose
Mono face: JetBrains Mono — data rows, variable chips, top-rule captions
Built with: Pelican (Python static site generator). Source: github.com/Sohailm25/Sohailm25.github.io.
Companion calculator: sohailmo.ai/book/calculator/ — Marimo notebook hosting the LCPR formulas, sensitivity sweeps, and break-even analysis used throughout this book.
Last revised: 2026-05-18