Calculator Manual And Living Reference
A.1 What The Calculator Does
The calculator is the execution layer for this book. The body teaches the mechanism, wrong conclusion, and decision rule. The calculator carries the repeatable math: cost models, sensitivity analysis, break-even gates, and reconciliation views.
The calculator takes four kinds of input:
Traces. Per-request event records from your inference serving system: model, tokens, latency, cache state, quality label, cost.
Invoices. Provider billing data for a matching period: line items, credits, taxes, committed spend, overages.
Evals. Quality labels on outputs: pass, fail, repair needed, human escalation. These set the denominator for accepted work.
Configuration. Workload identity, SLO thresholds, routing policy, pricing snapshot, account revenue. These set the constraints the calculator evaluates against.
The calculator produces views. Each view answers one question. For each view this appendix documents the formula, the inputs, the output, and when not to trust it. The companion app lives at inference-econ.streamlit.app.
Implementation status (which views the app exposes directly vs as templates) lives in the calculator's view registry, not in this prose.
A.2 Calculator Views
The appendix documents fourteen book-facing calculator views. The
companion app implements them across thirteen tabs (finance, usage,
compliance, and latency share an operating tab). Body prose names views
by function; the calculator names each view in
view_registry.py for schema stability.
View 1: Workload Profile
Internal name: workload_profile (registry: Workload Profile v1)
Body reference: “workload profile” or “workload identity” (Part 1 Ch 2, Part 3 Ch 10)
What it does: Captures the canonical identity of one workload: class, SLO tier, accepted work unit, token distributions, security constraints, cache eligibility, batch eligibility, owner, and current route.
Inputs:
Field |
Source |
Required |
|---|---|---|
workload_id |
Your naming convention |
Yes |
workload_class |
Taxonomy from Ch 10 |
Yes |
accepted_work_unit |
Definition from Ch 2 |
Yes |
slo_class |
interactive / near-real-time / background / batch |
Yes |
ttft_slo_ms |
Product requirement |
If interactive |
e2e_slo_ms |
Product requirement |
If interactive |
input_token_distribution (p50/p95/p99) |
Trace analysis |
Yes |
output_token_distribution (p50/p95/p99) |
Trace analysis |
Yes |
quality_floor |
Eval threshold |
Yes |
security_constraints |
Compliance team |
Yes |
Output: A structured workload record that feeds every other calculator view.
When not to trust it: If token distributions come from fewer than 100 traces, the p95 and p99 are unreliable. If the quality floor has not been validated with human review, the accepted-work denominator is wrong.
View 2: Trace Event Schema
Internal name: trace_event_schema (registry: Trace Event Schema v1)
Body reference: “trace event schema” or “request event fields” (Part 1 Ch 3, Part 5 Ch 25)
What it does: Defines the minimum fields per inference request needed for economics-grade analysis. This is a schema definition, not a dashboard.
Minimum fields per event:
Field |
Type |
Purpose |
|---|---|---|
request_id |
string |
Deduplicate and join |
timestamp |
ISO 8601 |
Ordering and window assignment |
workload_id |
string |
Attribution |
model_id |
string |
Pricing lookup |
provider |
string |
Invoice join |
input_tokens |
int |
Cost numerator |
output_tokens |
int |
Cost numerator |
cached_input_tokens |
int |
Cache economics |
ttft_ms |
float |
SLO gate |
e2e_ms |
float |
SLO gate |
quality_label |
enum |
Accepted-work denominator |
is_retry |
bool |
Retry cost allocation |
cache_hit |
bool |
Cache hit rate |
route_id |
string |
Multi-source attribution |
cost_usd |
float |
Trace-derived cost |
When not to trust it: If cached_input_tokens is missing or always zero,
cache economics calculations will be wrong. If quality_label is missing, the
calculator cannot distinguish accepted work from raw throughput.
View 3: Latency Decomposition
Internal name: latency_decomposition (registry: Latency Decomposition v1)
Body reference: “latency decomposition” or “timing breakdown” (Part 2 Ch 4-7)
What it does: Breaks end-to-end latency into components: queue time, TTFT (prefill), inter-token latency (decode), tool execution, retrieval, and orchestration overhead.
Formula: T_e2e = T_queue + T_ttft + (T_tpot × N_out) + T_tool + T_retrieval + T_orchestration.
Output: Per-request timing breakdown showing which component dominates.
When not to trust it: If the serving engine does not expose queue time separately, the calculator cannot distinguish queueing from prefill. If tool calls are asynchronous, T_tool may overlap with T_decode.
View 4: SLO-to-Route Mapping
Internal name: slo_to_route_mapping (registry: SLO-to-Route Mapping v1)
Body reference: “SLO-to-route mapping” or “routing policy” (Part 3 Ch 10, Part 4 Ch 20)
What it does: Maps each workload’s SLO constraints to eligible routes. A route is eligible only if it passes all gates: latency, quality, security, and cost.
Inputs: Workload profile (View 1), route candidates with measured latency and quality, security constraints.
Decision rule: A route is eligible if measured TTFT p95 ≤ ttft_slo_ms, measured E2E p95 ≤ e2e_slo_ms, measured eval pass rate ≥ quality_floor, and the route satisfies every security gate in the workload profile.
Output: Eligible route set per workload, ranked by cost per accepted work.
When not to trust it: If latency was measured under synthetic load and production concurrency is higher, p99 will be worse than measured. If quality was measured on a different prompt distribution, eval pass rate may not transfer.
View 5: Cost Per Accepted Work (LCPR)
Internal name: cost_per_accepted_work (registry: Cost Per Accepted Work v1; also exposed in the app as LCPR via the Compare tab)
Body reference: LCPR (Part 1 Ch 1-2, every worked example)
What it does: Calculates loaded cost per result, normalized to accepted output.
Formula (Derivation 6): LCPR = (C_inference + C_eval + C_human + C_ops + delta) / A
Where: C_inference is the sum of per-request costs from traces (first attempt + retry + repair, with cache pricing applied); C_eval is the eval-grader cost (LLM grader calls and rubric checks); C_human is human review, escalation, and repair labor; C_ops is amortized operational overhead (monitoring, on-call, deployment) allocated to the workload; delta = C_invoice − C_trace is the discrepancy between trace-derived cost and the provider invoice for the same period; and A is accepted work units (requests passing quality, latency, and reliability gates).
What the result means: LCPR is the real unit cost your business operates on. If LCPR exceeds revenue per accepted unit, the workload loses money regardless of what the token price page says.
When not to trust it: If human escalation cost is excluded, LCPR understates the real cost. If accepted work is defined loosely (no quality gate), LCPR overstates efficiency. If the invoice window does not match the trace window, delta will be noisy.
View 6: Spend Movement Waterfall
Internal name: spend_movement_waterfall (registry: Spend Movement Waterfall v1)
Body reference: “spend movement waterfall” or “month-over-month spend” (Part 5 Ch 26)
What it does: Decomposes month-over-month spend change into contributing factors: volume, mix, price, cache rate, quality, retry rate, and new workloads.
Output: A waterfall showing how much of the spend change is explained by each factor. The unexplained residual is the investigation target.
When not to trust it: If workload attribution is inconsistent across months, the mix column will absorb real changes as apparent mix shifts.
View 7: Commitment Utilization
Internal name: commitment_utilization (registry: Commitment Utilization v1)
Body reference: “commitment utilization” or “credit coverage” (Part 4 Ch 18, Part 5 Ch 26)
What it does: Tracks committed spend against actual usage. Shows burn rate, projected exhaustion date, overage exposure, and utilization percentage.
When not to trust it: If credits apply to some models but not others, and the calculator does not model credit scope, utilization will be wrong.
View 8: Variance Analysis
Internal name: variance_drilldown (registry: Variance Drilldown v1)
Body reference: “variance analysis” or “cost/latency root cause” (Part 5 Ch 25, 26)
What it does: Identifies the top contributors to cost or latency variance between periods or between expected and actual. Decomposes variance into rate (price change), volume (traffic change), mix (workload shift), efficiency (cache rate, retry rate, output length), and quality (eval pass rate change).
When not to trust it: If traces have inconsistent model aliases across periods, rate variance will appear where none exists.
View 9: Account Margin Model
Internal name: account_margin_model (registry: Account Margin Model v1)
Body reference: “account margin model” or “account economics” (Part 5 Ch 25, 26)
What it does: Joins LCPR, account revenue, support cost, and operational overhead to calculate gross margin per account or per workload.
Formula (Derivation 6 extended): margin = (R_account − LCPR × A − C_support − C_overhead_allocated) / R_account, where R_account is account revenue, A is accepted work units, C_support is account-specific support cost, and C_overhead_allocated is the allocated share of platform overhead.
When not to trust it: If account revenue includes bundled products that are not inference-related, margin attribution is ambiguous.
View 10: Usage Signals
Internal name: usage_signals (registry: Usage Signals v1)
Body reference: “usage signals” or “expansion/risk signals” (Part 5 Ch 26)
What it does: Tracks leading indicators of workload health: volume trend, cache hit rate trend, retry rate trend, p95 latency trend, eval drift, context length growth, and provider error rate.
When not to trust it: If telemetry sampling rate changes between periods, volume trends will reflect sampling, not usage.
View 11: Security and Compliance Filter
Internal name: security_compliance_filter (registry: Security and Compliance Filter v1)
Body reference: “security/compliance route filter” (Part 3 Ch 10, Part 4 Ch 21)
What it does: Evaluates each candidate route against security constraints before any performance or cost comparison. A route that fails any security gate is ineligible regardless of price.
Gates: data residency, zero-data-retention, private networking, model license, logging/audit requirements, regulatory scope (HIPAA, SOC2, etc.).
When not to trust it: If security requirements are assumed rather than confirmed with the compliance team, the filter may be too permissive or too restrictive.
View 12: Cache Break-Even
Internal name: cache_break_even (registry: Cache Policy Gate v1)
Body reference: “cache break-even analysis” or “cache eligibility” (Part 2 Ch 7)
What it does: Calculates how many requests must reuse a cached prefix within TTL for caching to save money.
Formula (Derivation 3): N_break_even = P_write / (P_uncached − P_read), where P_write is the cache write premium per million input tokens, P_uncached is the standard input price per million tokens, and P_read is the cache read price per million tokens. Cache saves money only when the prefix is reused at least N_break_even times within TTL.
Inputs: Cache write price, cache read price, uncached input price, TTL, measured inter-request gap, prefix stability.
What the result means: If your measured reuse count within TTL is below break-even, caching costs more than it saves.
When not to trust it: If the prefix contains any volatile content (timestamps, request IDs, retrieval results before the static prefix), the effective hit rate will be lower than measured.
View 13: KV Memory Sizing
Internal name: kv_memory_sizing (registry: KV Capacity Envelope v1)
Body reference: “KV memory sizing” or “context-length capacity” (Part 2 Ch 6)
What it does: Calculates maximum concurrent sequences at a given context length on a given GPU.
Formula (Derivation 2): N_max = (HBM_total − M_weights − M_activations) / (2 × L × H_kv × d_head × ctx × bytes_per_elem), where HBM_total is per-GPU memory, M_weights is model weight memory, M_activations is per-sequence activation reserve, L is the number of transformer layers, H_kv is the number of KV heads (after GQA), d_head is per-head dimension, ctx is sequence context length, and bytes_per_elem reflects KV dtype (fp16 = 2, fp8 = 1).
What the result means: This is the hard ceiling on concurrent conversations for a given model, hardware, and context length. The production-safe limit is lower after accounting for headroom, burst, and preemption avoidance.
When not to trust it: If the model uses MLA, sliding-window attention, or cross-request prefix sharing, the formula overestimates per-token KV cost. If beam search or n-sampling is active, live sequences multiply beyond user-facing concurrency.
View 14: Dedicated Break-Even
Internal name: dedicated_break_even (registry: Dedicated Utilization Gate v1)
Body reference: “dedicated capacity break-even” (Part 4 Ch 18)
What it does: Calculates the minimum utilization at which dedicated capacity is cheaper than serverless.
Formula (Derivation 4): u_required = (C_dedicated_hourly + C_ops_hourly) / (G_slo × P_serverless × 3600), where C_dedicated_hourly is the dedicated capacity hourly rate, C_ops_hourly is allocated operational overhead per hour, G_slo is goodput in accepted requests per second under the SLO, and P_serverless is the equivalent serverless price per accepted request. Dedicated is cheaper only when actual utilization exceeds u_required.
What the result means: If your actual weekly-average utilization is below u_required, stay serverless or use hybrid (base dedicated + burst serverless).
When not to trust it: If goodput (G_slo) was measured with synthetic traffic that does not match production prompt/output distributions, the break-even will be optimistic. If operational overhead (C_ops) excludes on-call, deployment tooling, and incident response, the gate is too permissive.
A.3 Running The Worked Examples
The companion repo contains three worked examples. Each has a calculator seed that provides the inputs and a workload identity file that describes the workload. To run an example:
Copy the seed file.
Fill null fields from
source-snapshots/2026-05-12/providers.yamlor your own pricing data.Load the seed into the calculator.
Review the views listed in the seed’s
calculator_viewsfield.
Example 1: Support Answer trace-to-loaded-cost
Seed: examples/support-answer.trace-margin.v1/calculator-seed.yaml
Question it answers: What does one accepted support answer actually cost after retries, eval failures, cache misses, and human escalation?
Views exercised: LCPR, trace-to-loaded-cost review, cache policy gate.
What to look for: - The gap between naive cost (inference cost / total queries) and loaded cost (LCPR). The seed mirrors the opener fixture: a regional bank answer-drafting workload at ~30,247 first-attempt queries per workday, 28,674 accepted answers, $503/day inference, $5,287/day loaded cost, $0.184 LCPR, $0.0166 naive per-query cost, and a roughly 10x loaded-to-naive ratio. - The cache hit rate delta. The seed shows expected 60% vs actual 35%. The cache break-even view shows whether caching is still net positive at 35%. - Eval pass rate. Roughly one in four attempts does not produce accepted work, but every attempt costs money.
Example 2: Coding Agent Task Lifecycle
Seed: examples/coding-agent.lifecycle.v1/calculator-seed.yaml
Question it answers: What does one accepted bug fix cost across an entire agent session with sub-agent calls, tool loops, compaction, and retries?
Views exercised: LCPR, cache policy gate, goodput frontier test.
What to look for: - Total input tokens processed (~3.5M) vs output tokens (~85K). The input-to-output ratio is roughly 41:1. Input cost dominates unless cache hit rate is high. - Cache hit rate difference between main agent (82%) and sub-agents (45%). Sub-agents start with fresh context and cannot reuse the main agent’s prefix. - Compaction events as cost events. Each compaction discards cached context and forces a cache write on the next turn. - The overall acceptance rate (90%) includes first-pass success (65%) plus repaired tasks (25%). The 10% that require manual takeover have human cost outside the inference bill.
Example 3: Benchmark Audit
Seed: examples/support-rag-answer-drafting.audit.v1/calculator-seed.yaml
Question it answers: Does the benchmark winner change when methodology errors are corrected?
Views exercised: goodput frontier test, routefit matrix, LCPR.
What to look for: - The naive benchmark uses closed-loop arrivals, cold cache, excluded cold starts, no retries, and mean latency only. The corrected benchmark uses Poisson arrivals, warm cache, included cold starts, retry policy, and p99 latency. - Route A wins on mean throughput. Route B wins on goodput (accepted work per second under SLO). The missing metadata checklist in the seed documents exactly which methodology gaps created the false winner. - The eval pass rate gap: 72% (Route A) vs 91% (Route B). Quality is the largest contributor to the cost-per-accepted-work reversal.
A.4 Source Snapshot Schema
Provider prices belong in dated snapshots, not in body prose. The
snapshot schema lives in source-snapshots/<YYYY-MM-DD>/providers.yaml (alongside hardware.yaml, cache-semantics.yaml, benchmark-sources.yaml, and model-licenses.yaml).
Required fields per provider entry:
Field |
Type |
Purpose |
|---|---|---|
provider |
string |
Provider name |
model |
string |
Model identifier |
pricing_tier |
string |
online / batch / dedicated |
input_per_mtok |
float |
USD per million input tokens |
output_per_mtok |
float |
USD per million output tokens |
cache_write_per_mtok |
float or null |
Cache write premium |
cache_read_per_mtok |
float or null |
Cache read price |
cache_ttl |
string or null |
TTL specification |
context_window |
int |
Maximum context length |
batch_discount_pct |
float or null |
Batch API discount |
source_url |
string |
Official pricing page URL |
accessed_date |
string |
ISO 8601 date of last verification |
notes |
string or null |
Semantic caveats |
Refresh protocol:
Before any publication event, visit each
source_urland verify prices.If a price has changed, create a new snapshot directory with the current date.
Do not overwrite old snapshots. They are the historical record.
Update calculator seeds to reference the new snapshot.
Evidence type for snapshot data: PUBLIC when prices come from official pricing pages. If a price is inferred, modeled, reported by a third party, or unavailable, label it explicitly and add a note explaining the evidence source.
A.5 Formulas Reference
The six derivations in the body are the mathematical spine of the calculator. Each derivation kills a wrong conclusion that survives in naive analysis.
Derivation 1: Batch Amortization / Roofline (Part 2, Ch 4-5)
Wrong conclusion: “GPU-hour divided by peak tokens/sec is my cost.”
Calculator views: goodput frontier test, serving physics lens.
Derivation 2: KV Cache Memory Sizing (Part 2, Ch 6)
Wrong conclusion: “Context length is just a model setting.”
Calculator view: kv capacity envelope.
Derivation 3: Prompt-Cache Break-Even (Part 2, Ch 7)
Wrong conclusion: “Caching is a discount.”
Calculator view: cache policy gate.
Derivation 4: Dedicated Utilization (Part 4, Ch 18)
Wrong conclusion: “Dedicated is cheaper because the hourly rate looks low.”
Calculator view: dedicated utilization gate.
Derivation 5: Goodput Under SLO (Part 5, Ch 24)
Wrong conclusion: “Peak throughput is capacity.”
Calculator view: goodput frontier test.
Derivation 6: trace-to-loaded-cost Reconciliation (Part 5, Ch 25)
Wrong conclusion: “The provider bill is my margin model.”
Calculator view: trace-to-loaded-cost review.
A.6 Glossary
Terms are defined at first use in the body. This glossary collects them for reference.
Accepted work unit. A completed inference result that passes all quality, latency, and reliability gates. The denominator for LCPR. Defined in Part 1 Ch 2.
Batch size. The number of sequences processed simultaneously in one scheduler step. Larger batches amortize weight fetch but increase KV memory pressure and can degrade tail latency. Defined in Part 2 Ch 5.
Cache hit rate. The fraction of input tokens served from prefix cache rather than recomputed. An economic variable, not only a performance metric. Defined in Part 2 Ch 7.
Compaction. Summarizing or truncating an agent’s conversation history to fit within the context window. A cost event (discards cached context, forces cache write) and a quality event (summary may lose information). Defined in Part 3 Ch 12.
Decode. The autoregressive generation phase where output tokens are produced one at a time. Often memory-bandwidth-bound. Defined in Part 2 Ch 5.
Delta. The discrepancy between trace-derived cost and provider invoice for the same period. Investigate if delta exceeds 5%. Defined in Part 5 Ch 25.
E2E latency. End-to-end latency from request arrival to last output token delivered. Defined in Part 2.
Eval pass rate. The fraction of outputs passing the quality gate. Directly affects the accepted-work denominator and therefore LCPR. Defined in Part 5 Ch 23.
Goodput. Accepted requests per unit time, where “accepted” means passing latency, quality, and reliability gates. The correct capacity metric. Defined in Part 5 Ch 24.
GQA (Grouped Query Attention). An attention variant where multiple query heads share fewer KV heads, reducing KV cache memory. Defined in Part 2 Ch 6.
HBM (High Bandwidth Memory). GPU memory. Capacity determines how many weights and KV cache entries fit. Bandwidth determines how fast they can be read. Defined in Part 2 Ch 4.
KV cache. Key and value tensors stored per layer per token for each live sequence. The hidden memory bill of inference. Defined in Part 2 Ch 6.
LCPR (Loaded Cost Per Result). The book’s core economic unit. Total loaded cost (inference + invoice delta + evals + repairs + human escalation + ops overhead) divided by accepted work units. Defined in Part 1 Ch 2.
MoE (Mixture of Experts). A model architecture where only a subset of parameters is active per token, improving compute efficiency but creating routing, all-to-all communication, and memory-capacity tradeoffs. Defined in Part 2 Ch 8.
p99 latency. The latency at the 99th percentile. The tail that SLOs must account for. Averages hide p99. Defined in Part 2 Ch 4.
Prefill. The phase where all input tokens are processed in parallel to populate the KV cache. Compute-heavy. Defined in Part 2 Ch 5.
Prefix caching. Reusing KV cache entries from a previous request that shares the same prompt prefix. An economic mechanism when reuse exceeds break-even. Defined in Part 2 Ch 7.
Roofline. A model that identifies whether a computation is limited by compute throughput or memory bandwidth. Applied to inference to explain why cost changes with batch size. Defined in Part 2 Ch 4.
Route. A specific combination of provider, model, serving mode, and configuration used to serve a workload. Defined in Part 3 Ch 10.
Routefit matrix. The framework for mapping workload identity and SLO constraints to eligible routes, ranked by cost per accepted work. Defined in Part 4 Ch 15.
Serving physics. The hardware and software constraints that determine inference cost and latency: memory bandwidth, compute throughput, KV capacity, batch scheduling, queueing, and topology. Defined in Part 2.
SLO (Service Level Objective). The internal target for latency, quality, and reliability. Not the same as an SLA (contractual commitment to customers). Defined in Part 1.
Snapshot. A dated record of provider prices, hardware specs, or model availability. Snapshots are immutable. New observations create new snapshots. Defined in this appendix.
TPOT (Time Per Output Token). Latency per generated token during decode. TPOT times output length determines decode duration. Defined in Part 2 Ch 5.
Trace. A per-request record containing tokens, latency, cost, quality label, and routing metadata. The atomic unit of inference economics analysis. Defined in Part 1 Ch 3.
Trace autopsy. The diagnostic walkthrough from one trace to root cause. The book’s primary diagnostic framework. Defined in the Trace Autopsy article (see front matter cross-links).
Trace-to-margin review. The framework for joining traces, invoices, evals, and account data to calculate margin. The book’s reconciliation instrument. Defined in Part 5 Ch 25.
TTFT (Time To First Token). Latency from request arrival to first output token. Dominated by prefill computation and queue time. Defined in Part 2 Ch 5.
Utilization. The fraction of provisioned capacity that is doing productive work. High utilization can hide unproductive recomputation, allocation waste, or queueing. The correct metric is goodput, not utilization. Defined in Part 2 Ch 9.
Workload class. A category of inference work where the economics, SLOs, failure modes, and operating decisions differ enough to justify separate treatment. Defined in Part 3 Ch 10.
A.7 What Lives In The Repo
The companion repo extends this appendix with material that does not belong in static pages:
Repo artifact |
Purpose |
Refresh cadence |
|---|---|---|
|
Dated provider pricing snapshots |
Before any publication event |
|
Worked example seeds and workload identities |
When examples are updated |
|
Calculator source code and tests |
Ongoing |
|
Canonical app/view inventory and evidence status |
When views change |
|
Planned YAML schemas for workload identity, trace event, calculator seed |
When formal schemas are added |
|
Research notes, derivation drafts, and claim matrices |
Archived after publication |
Evidence note [A-1]: All calculator formulas are DERIVED from assumptions shown in the body. All prices in source snapshots are PUBLIC. All worked example data is SYNTHETIC, shaped by public sources but not measured from production systems.
End of Appendix.
Colophon
The Inference Field Guide was set in three typefaces from Google Fonts: Instrument Serif (italic display, chapter titles), Newsreader (body), and JetBrains Mono (numerics, ruled labels, code).
- Paper
#faf5e9— parchment, warm cream- Primary ink (moss)
#3A4F2A— headlines, rules, mad-libs chips, primary data, ornaments- Secondary ink (oxblood)
#5C2A1E— sidenotes, "Where this breaks" callouts, evidence tags- Display face
- Instrument Serif (italic) — chapter titles, landing
- Body face
- Newsreader (opsz 6–72) — running prose
- Mono face
- JetBrains Mono — data rows, variable chips, top-rule captions
- Built with
- Pelican (Python static site generator). Source: github.com/Sohailm25/Sohailm25.github.io.
- Companion calculator
- sohailmo.ai/book/calculator/ — Marimo notebook hosting the LCPR formulas, sensitivity sweeps, and break-even analysis used throughout this book.
- Last revised
- 2026-05-18