Part 3 produced a routefit matrix: each workload identity mapped to candidate routes, with feasibility gates, estimated LCPR, and primary constraints. Part 4 takes each route column and subjects it to a harder question: what does it actually take to get there, what goes wrong along the way, and when should you turn back?
Migration is not one decision. “Move to open models” is a different decision from “move to dedicated endpoints,” which is a different decision from “self-manage GPUs,” which is a different decision from “run multi-source.” Each surface has its own economics, failure modes, operational costs, and reversion triggers. And the first option, staying where you are, is a decision too.
The chapters proceed roughly in order of operational commitment. The model candidate funnel and routefit matrix filter candidates before any migration begins. Do Nothing requires the least change. Serverless open models require prompt portability and eval work but no infrastructure ownership. Managed dedicated requires capacity planning and utilization discipline. Self-managed GPUs require an engineering team, on-call rotation, and hardware lifecycle management. Multi-source and background agent routing require measurement discipline across providers and latency tiers. The final chapter defines migration gates and reversion signals that apply to all transitions.
Chapter 15: The Model Candidate Funnel And routefit matrix
The field problem
Fourteen model-provider combinations. Three months of evaluation. At the end: a spreadsheet of benchmark scores, pricing tiers, and feature matrices. The cheapest option scoring above 80% on MMLU wins the selection. In production, the model fails the data residency requirement (discovered in month two), the output format drifts from the eval schema (discovered in week three of deployment), and the batch discount does not apply to the selected model variant (discovered on the first invoice).
The evaluation compared options that should have been eliminated in the first hour.
The mechanism
The model candidate funnel eliminates infeasible options before comparing cost. Most teams do this backwards: they start with economics and discover feasibility constraints late. The funnel inverts this.
Stage 1: Hard constraints filter. Eliminate candidates that cannot serve the workload regardless of price.
Constraint |
Question |
Eliminates |
|---|---|---|
Context length |
Does the model support the workload’s p95 input length? |
Models with insufficient context |
Modality |
Does the model handle required input/output types (text, vision, audio, structured)? |
Text-only models for vision workloads |
Data residency |
Does the provider offer deployment in the required region? |
Providers without regional presence |
Model rights |
Does the license or TOS permit the use case (training data, output ownership, competitive use)? |
Models with restrictive licenses |
Compliance |
Does the provider hold required certifications (SOC2, HIPAA BAA, ISO 27001)? |
Uncertified providers for regulated workloads |
Rate limits |
Does the provider’s rate limit cover the workload’s p95 request rate? |
Providers that throttle the workload |
This stage should take hours, not weeks. The output is a short list, typically 3-6 candidates from the original 10-20.
Stage 2: Capability screen. Run 50-200 production-representative test cases through each surviving candidate. Not a general benchmark. Not a vibes check. Production-representative inputs with known-good outputs and deterministic quality checks where possible.
The capability screen answers: does this model produce acceptable output for this workload at the minimum quality floor? Candidates that fall below the quality floor are eliminated regardless of price. A model that costs 80% less but fails 30% of production cases is not cheaper. It is more expensive after repairs and escalations.
Stage 3: Economic evaluation. For surviving candidates, calculate LCPR per accepted output. Not token price. Not benchmark throughput. LCPR from traces: input tokens, output tokens, cache hit rate, quality gate cost, retry rate, and the accepted-work denominator.
This is where the calculator becomes essential. Run each candidate through the LCPR formula with workload-specific parameters from the workload identity schema (Part 3 Chapter 10).
Stage 4: Route mapping. The surviving candidates become columns in the routefit matrix.
The routefit matrix
The routefit matrix maps workload identities (rows) to serving routes (columns). Each cell is a measured LCPR per accepted output, not a pricing-page estimate. The illustration below shows a three-workload fleet at the end of a funnel pass:
Workload identity |
Closed API (baseline) |
Serverless open |
Managed dedicated |
Self-managed |
|---|---|---|---|---|
Support resolution (500K/mo, diurnal) |
$0.041 |
$0.031 |
$0.034 |
N/A |
Document extraction (compliance-bound, 80K/mo) |
$0.058 |
— |
— |
N/A |
Background classification (2M/day, stable) |
$0.012 |
$0.009 |
$0.006 |
$0.005 |
* N/A means the candidate was eliminated in Stage 1 (volume too low for dedicated). Em-dashes mean the route was eliminated by the funnel (failed feasibility or capability screen). Bold cells are the recommended route. The matrix makes the migration decision visible: which routes are feasible, which are cheapest, and which require operational investment.
The naive answer
“Compare prices on the provider’s pricing page and pick the cheapest.”
Pricing pages show token rates. LCPR includes cache behavior, quality gates, retry rates, and accepted-work denominators. The cheapest token rate is not always the cheapest LCPR. The funnel exists to prevent the team from optimizing the wrong metric.
The better model
Run the funnel before the spreadsheet. The funnel is fast: Stage 1 takes hours (constraint checks), Stage 2 takes days (eval runs), Stage 3 takes a week (instrumented traffic). The routefit matrix is the output. Every migration decision in the chapters that follow uses the matrix as its input.
Do not compare economics until feasibility and capability are confirmed. The funnel order (constraints, capability, economics, route mapping) prevents wasted evaluation effort. Update the routefit matrix quarterly, or when models, pricing, or workload shapes change materially.
What to measure
Candidate count at each funnel stage (how many eliminated where)
Time to complete each funnel stage
Quality floor pass rate by candidate on production-representative evals
LCPR per accepted output by candidate (from traces, not pricing pages)
routefit matrix coverage (which workload-route cells are measured vs estimated)
The funnel breaks when the eval set does not represent production traffic. A model that passes 200 curated test cases may fail on the long tail of real inputs. The funnel also breaks when Stage 1 constraints change after Stage 3 evaluation is complete. A new data residency requirement can invalidate weeks of evaluation work. Run Stage 1 against current and anticipated constraints.
Calculator hook
input template. One row per workload identity, one column per surviving route candidate. Each cell populates from LCPR with workload-specific parameters. The calculator flags cells with estimated rather than measured LCPR.
Chapter 16: Do Nothing Is A Decision
The field problem
Monthly bill: $85,000 across three workloads on a closed API. Back-of-envelope calculation: open models on serverless endpoints would cost $35,000. The migration starts. Three months later, it is incomplete. Two workloads moved successfully. The third, a compliance-sensitive document extraction pipeline, failed the data residency gate, the output format changed between model families, and the eval suite needed 400 new test cases. The migration consumed $120,000 in engineering time. Staying on the closed API for six more months while building the eval suite and resolving the compliance requirement would have been cheaper.
The mechanism
Staying on the current provider is a valid economic outcome if the total cost of migration (engineering time, eval development, prompt rewriting, quality regression risk, operational learning curve, and opportunity cost) exceeds the savings over the planning horizon.
The naive calculation compares token prices. The correct calculation compares:
payback period = migration cost ÷ (current LCPR − target LCPR) × monthly accepted volume
If the payback period exceeds the planning horizon, or if the denominator is uncertain because the target LCPR has not been measured, Do Nothing is the rational choice until better data arrives.
The naive answer
“We’re overpaying. We need to migrate immediately.”
Maybe. But the gap between the current provider’s token price and the candidate’s token price is not the migration savings. The migration savings are the gap between current LCPR and target LCPR, minus the migration cost, over the payback period. If the LCPR gap is 40% but migration consumes a team-quarter (typically 4-12 engineer-weeks depending on workflow complexity, eval maturity, and prompt-pipeline drift), the payback calculation matters more than the token-price comparison.
The better model
Do Nothing is the right decision when any of these conditions hold:
The workload is small. Under about $5,000/month in inference spend, migration engineering time may never pay back. Optimize prompts, enable caching, reduce output length, and revisit when the workload grows.
The eval suite does not exist yet. Without task-specific evals, you cannot measure whether the new route matches the old route’s quality. Migrating without evals is moving blind. Build the eval first; migration becomes a measured experiment rather than a bet.
Compliance or contractual constraints eliminate the target route. Data residency, model training rights, BAA requirements, or customer contracts may make the target route infeasible regardless of price. Verify feasibility before calculating savings.
The planning horizon is short. If the product may pivot, the workload may change shape, or the team may restructure within six months, locking in migration work creates sunk cost against an uncertain future.
The current provider is improving. Model quality, pricing, caching features, and latency improve continuously. The gap you measure today may narrow by the time migration completes.
Before starting a migration, calculate the payback period. If it exceeds 6 months with uncertain LCPR estimates, stay and invest in measurement instead: build the eval suite, instrument the traces, calculate current LCPR, and populate the routefit matrix. Those investments pay off regardless of the migration decision.
What to measure
Current LCPR per accepted output by workload (not token price)
Engineering cost estimate for migration (person-months)
Eval suite coverage and maturity
Compliance gates for candidate routes
Planning horizon for the product and team
Do Nothing becomes wrong when the current provider’s pricing, reliability, or quality degrades and the team has the eval suite and operational capacity to migrate. The risk of Do Nothing is lock-in and complacency: staying because migration seems hard, not because it is uneconomic. Review the decision quarterly. If the routefit matrix shows a feasible, measured, cheaper route, the payback period has shortened, and the team has evals—migrate.
Calculator hook
The do-nothing comparison takes current route vs candidate route, with migration cost amortized over the payback period. The sensitivity analysis shows how engineering cost assumptions change the payback period.
Chapter 17: Move Workloads To Serverless Open Models
The field problem
Token price drops 60% after migrating a support chat workload from a closed API to a serverless open-model endpoint. Two weeks later, customer satisfaction scores drop 8 points. The eval suite—built for the closed model’s output format—does not catch the new failure mode: the open model produces correct answers but in a different conversational style that confuses the downstream state machine. The “60% savings” is now a 60% cost reduction with a quality regression that increases human escalation by 12%.
The mechanism
Serverless open-model endpoints are the default first migration step for most teams because they require no infrastructure ownership, no capacity planning, and no GPU management. The economics change from “pay per token at the closed model’s rate” to “pay per token at the open model’s rate.”
The economic advantage of open models on serverless endpoints comes from three places.
Model size efficiency. A 70B open model may match a frontier closed model on many tasks at lower per-token cost. The smaller model processes fewer FLOPs per token, which translates to lower provider serving cost, which translates to lower token price.
Competition. Multiple providers serve the same open models — Llama, Mistral, Qwen, DeepSeek, GLM, and Kimi families are all multi-host. The catalog of serverless open-model providers in mid-2026 spans Anyscale Endpoints, Cerebras, DeepInfra, Fireworks, Groq, Hyperbolic, Replicate, and Together, plus the hyperscaler open-model surfaces (Bedrock for Llama, Vertex for several open families). Price competition across this set, not any single provider, is what pushes serverless open-model rates below what a sole-source provider would charge.
Caching and batch availability. Open-model serverless endpoints usually support the same caching and batch features as closed APIs. The Part 2 chapters on caching and batch economics apply unchanged.
The risks are:
Prompt portability. Prompts engineered for one model family may not transfer to another. System prompts, tool definitions, output format instructions, and few-shot examples often need rewriting. This is engineering work, not a configuration change.
Quality regression. The open model may score well on general benchmarks but fail on the specific task distribution. The model candidate funnel from Chapter 15 exists to catch this, but only if the eval set represents production traffic.
Eval coverage. Moving from a closed model to an open model changes the failure-mode distribution. New failure modes may not be covered by the existing eval suite.
The naive answer
“Open models are cheaper. Switch everything.”
Open models are cheaper per token. Whether they are cheaper per accepted output depends on quality match, repair rate, and human escalation rate. A model that costs 60% less per token but produces 20% more repair work may cost more per accepted output than the original.
The better model
Use the model candidate funnel from Chapter 15:
Feasibility filter: Does the open model support the required context length, modalities, output format, and license terms?
Capability screen: Does the open model pass the minimum quality floor on 50-200 production-representative test cases?
Economic evaluation: What is the LCPR per accepted output on the serverless open-model endpoint, including cache hit rate, retry rate, and quality gate cost?
Prompt portability assessment: How many engineering hours does prompt rewriting require? Include this in the migration cost.
The migration is economic when:
(LCPR_current − LCPR_serverless_open) × monthly_volume × payback_horizon > migration_cost
And the quality gate confirms that the quality floor is met.
Worked example
A 500K-resolution-per-month support workload migrated from a closed Sonnet-class API to a serverless open-model endpoint on a 70B Llama-class model. Token price gap: roughly 78%. Below are sample results from the 4-week canary at 35% traffic.
Metric |
Closed (baseline) |
Serverless open (canary) |
|---|---|---|
Cache hit rate |
0.71 |
0.79 (different prefix handling, simpler tool schema) |
Quality pass rate |
0.91 |
0.86 |
Repair rate |
5.2% |
9.8% |
Repair success rate |
87% |
94% (the new model accepted repair instructions more cleanly) |
Escalation rate |
3.1% |
4.9% |
LCPR per accepted resolution |
$0.041 |
$0.031 |
Prompt rewrite cost |
— |
~$11K (lower than estimated; existing tool schemas ported cleanly) |
Payback period at full ramp |
— |
~2.2 months ($11K rewrite ÷ $5K/mo full-ramp savings; add 0.5-1 month of ramp-up before savings reach steady state) |
Two side-findings the migration deck did not predict: cache hit rate improved on the new endpoint (longer effective TTL plus a simpler tool schema), and repair success improved (the new model handled “your previous answer missed X” prompts more cleanly than the baseline). The token-price gap was misleading. The LCPR gap was real. The headline savings landed at about 22%, not 78%. After payback the savings compound, but the lesson the deck buried was that two of the dimensions the team was sure would degrade actually got better, which is what real migrations look like.
Serverless open models are economic when the LCPR gap (not the token price gap) exceeds migration cost within an acceptable payback period, and the quality floor is met. Always measure LCPR on the candidate route before committing. The model candidate funnel is the gate, not the pricing page.
What to measure
LCPR per accepted output on both routes (before and after migration)
Quality pass rate on production-representative evals
Prompt portability effort (engineering hours)
Cache hit rate on the new route (may differ due to different prefix handling)
Repair rate and escalation rate delta
Serverless open-model economics break when the workload needs capabilities the open model lacks: specific tool-call behavior, vision modalities not yet supported, output format adherence the model was not trained for, or compliance certifications the provider does not hold. Check the feasibility filter before the economics.
Also: serverless pricing changes. The competitive pressure that makes open-model serving cheap today may not persist if providers consolidate or if the model ecosystem shifts. Do not lock in a multi-year architecture assumption on today’s serverless prices.
Calculator hook
route comparison with migration cost amortization. Input: current route LCPR, candidate route LCPR, migration cost components. Output: payback period, monthly savings after payback, sensitivity to quality pass rate.
Chapter 18: Use Managed Dedicated
The field problem
Two million requests per day through a serverless open-model endpoint. Monthly bill: $42,000. An inference platform engineer's recommendation: price the same model on managed dedicated — same provider, same weights, but on reserved GPUs at a representative H100 reserved rate around $3.20/GPU-hour (mid-2026 reserved rates run roughly $2.95-$4.30 across CoreWeave, Lambda, RunPod, and Together). At 800 accepted requests per hour per GPU, the projection is $28,000/month. A 33% savings. Two GPUs are provisioned.
Three months later, the bill is $50,400/month, higher than serverless. The GPUs run 24/7 ($5,040/month each), but traffic is concentrated in 10 hours per day. During the other 14 hours, utilization averages 8%. The team is paying for idle hardware. The original projection assumed uniform utilization. Production has a pronounced diurnal pattern.
The mechanism
Managed dedicated endpoints convert token economics into capacity economics. Instead of paying per token, you pay per GPU-hour (or GPU-minute) for reserved hardware running your model. The provider manages the infrastructure; you manage the utilization.
The economic advantage: at sufficient utilization, the effective per-token cost is lower than serverless because you are buying wholesale capacity rather than retail tokens.
The economic risk: you pay for the GPU whether it is serving requests or sitting idle. Minimum replicas, warm-pool costs, and diurnal traffic patterns create utilization gaps that serverless does not have.
Derivation 4: The dedicated utilization gate
The wrong conclusion this derivation kills:
“Dedicated is cheaper because the hourly rate looks low.”
Dedicated capacity is cheaper only after utilization, SLO-compliant goodput, idle time, and operational overhead clear the threshold.
Symbol |
Unit |
Meaning |
|---|---|---|
C_hw |
USD/hour |
Hardware/endpoint cost per hour |
R_min |
replicas |
Minimum running replicas (cannot scale to zero) |
C_ops |
USD/hour |
Amortized operational overhead (monitoring, on-call, deployment) |
G_slo |
accepted/hour |
Measured SLO-compliant goodput per replica |
C_serverless |
USD/accepted |
Current serverless cost per accepted unit |
u_required |
fraction |
Minimum utilization for dedicated to break even |
Dedicated is cheaper only when u_observed × G_slo × C_serverless > (C_hw × R_min) + C_ops, and only when the realized utilization leaves enough headroom for p99 latency. Equivalently, the break-even utilization u_required = ((C_hw × R_min) + C_ops) / (G_slo × C_serverless); if your trough utilization is below u_required for sustained windows, dedicated is the wrong shape.
Worked example
Setup: Support answer drafting, considering dedicated vs serverless.
Parameter |
Serverless |
Dedicated |
|---|---|---|
Cost per accepted answer |
$0.012 |
— |
GPU-hour cost |
— |
$3.20/hr (illustrative H100 reserved midpoint) |
Min replicas |
— |
1 |
Ops overhead |
— |
$0.50/hr (monitoring, alerting) |
Measured goodput |
— |
800 accepted answers/hr under SLO |
At a blended cost of $3.70/hr (GPU plus ops overhead), the break-even utilization is roughly 39%. Dedicated monthly cost is fixed at $3.70/hr × 720 hr = $2,664. Serverless monthly cost = monthly volume × $0.012. Monthly volume = goodput × utilization-during-active-hours × active hours/month. The shape of the gate is what matters; against a $2.95/hr reserved rate the threshold drops to 35%, against a $4.30/hr on-demand rate it rises to 45%. Run the calculation against your contracted rate, not against any illustrative midpoint.
Traffic pattern |
Active hours/mo |
Volume/mo |
Avg utilization |
Dedicated |
Serverless |
Winner |
|---|---|---|---|---|---|---|
Stable 24/7 at 60% |
720 |
345,600 |
60% |
$2,664 |
$4,147 |
Dedicated |
10hr/day at 70%, 14hr idle |
300 |
168,000 |
29% |
$2,664 |
$2,016 |
Serverless |
8hr/day at 80%, 16hr idle |
240 |
153,600 |
27% |
$2,664 |
$1,843 |
Serverless |
Bursty with 4x peaks |
Varies |
Varies |
Varies |
$2,664 + burst |
Elastic |
Serverless |
The stable 24/7 workload saves roughly 36% on dedicated. The diurnal workloads cost 32-45% more on dedicated because the GPU runs 24/7 while traffic fills only 10-14 of 24 hours. The bursty workload is worse on dedicated because burst handling requires either over-provisioning or hybrid serverless fallback.
The naive answer
“The GPU hourly rate is half the serverless cost per token. Dedicated is obviously cheaper.”
The GPU hourly rate is a wholesale price. Wholesale is cheaper only when you consume enough volume to fill the wholesale capacity. A restaurant supply store is cheaper per unit than a grocery store, but only if you use 50 pounds of flour before it expires. Dedicated endpoints are the restaurant supply store.
The better model
The dedicated decision has three layers:
1. Utilization gate. Calculate
u_required = (GPU_hour_cost + ops_overhead) / (serverless_LCPR × measured_goodput).
If your trough utilization (nights, weekends) is below
u_required, pure dedicated costs more. Consider hybrid:
base dedicated capacity for the floor, serverless burst for peaks.
2. Goodput measurement. Measured goodput (accepted outputs per GPU-hour under SLO) must be measured on your workload, not estimated from provider benchmarks. Provider benchmarks use synthetic prompts at optimal batch sizes. Your workload has a specific prompt/output distribution, cache hit rate, and quality gate pass rate. Measure goodput with your traffic.
3. Operational cost. Dedicated endpoints need monitoring, deployment pipelines, model updates, capacity changes, and incident response. If the team does not have these capabilities, operational overhead is higher than the $0.50/hour used in the example. If the provider manages everything, ops overhead is lower but the GPU-hour price is higher.
Calculate u_required from your actual serverless LCPR and
your measured dedicated goodput. Plot your utilization by hour across a
full week including nights and weekends. If trough utilization exceeds
u_required with headroom for p99 latency, dedicated is
economic. If trough utilization is below u_required, stay
serverless or use a hybrid approach with a dedicated base and serverless
overflow.
What to measure
Utilization by hour across a full week (not just business hours)
Goodput under SLO on the dedicated endpoint with your real traffic shape
Serverless LCPR per accepted unit from traces (not from pricing page arithmetic)
Operational overhead: monitoring, deployment, on-call, incident time
p99 latency at different utilization levels (high utilization can degrade tail latency)
Managed dedicated endpoints may have minimum billing windows (5 minutes for some Bedrock custom model units). Short bursts that spin up and tear down frequently can accumulate billing-window waste.
Cold start latency is not just a user-experience problem—it is a cost problem. If scaling from zero to one replica takes ten minutes, that capacity arrives after the traffic spike has passed, and the GPU-seconds spent warming up produce no accepted results. Techniques like CUDA context checkpointing and lazy container filesystem loading have reduced cold start from tens of minutes to under sixty seconds for models up to a few tens of gigabytes. [PUBLIC: Modal, “Truly Serverless GPUs,” 2026] The decision rule: if your cold start latency exceeds the duration of your typical traffic spike, serverless autoscaling cannot help you, and you must pay for always-on warm-pool capacity—which erodes the utilization advantage that motivated the dedicated deployment in the first place.
Provider-managed dedicated endpoints abstract the GPU but may not expose the tuning controls (batch size, scheduler policy, KV cache configuration) needed to achieve the goodput measured in your benchmark.
GPU-hour prices vary by provider and commitment duration. On H100 in May 2026, on-demand single-GPU rates run roughly $3.99/hr (Together), $4.29/hr (Lambda 1x), and $6-7/hr (Fireworks, CoreWeave), with reserved 91-180 day rates falling to roughly $2.95-3.50/hr at Together, CoreWeave, and several neo-clouds; Lambda and AWS require custom-negotiated reserved terms (provider pricing pages, accessed 2026-05-13). The utilization gate threshold moves with the rate: at the low end of the reserved range, dedicated wins on lower utilization; at the on-demand end, it requires sustained utilization above 50%. Plug in your contracted rate; do not run the calculation against any single posted price.
Calculator hook
The dedicated capacity break-even analysis takes serverless LCPR, dedicated GPU-hour cost, measured goodput, operational overhead, and hourly utilization pattern. It outputs the break-even utilization, monthly cost comparison, and the hybrid split point where base dedicated plus burst serverless is cheapest.
Chapter 19: Fine-Tuning and Post-Training as Cost Levers
The field problem
A support-automation team running a 70B Llama-class model on managed dedicated noticed that 38% of resolutions were going through a two-call repair loop: the first call produced an answer in the wrong tone, a second call rewrote it. The token-cost line was real. The deeper line was harder to see: every repair extended end-to-end latency by 1.8 seconds and added a quality-gate pass against the rewritten output, which itself failed 12% of the time. The team's first instinct was to switch to a larger model. The actual fix was a $4,200 LoRA adapter, trained on three months of approved-style production answers, that brought the first-call accept rate from 62% to 89%. Inference cost dropped 31%; latency dropped 1.4 seconds at p90.
The chapter so far has treated the route (serverless, managed dedicated, self-managed) as the cost lever. Model customization is a parallel lever that often does more.
The mechanism
Fine-tuning and post-training change the model itself rather than the route it runs on. A LoRA adapter — a small set of low-rank weights trained on top of a base model — can ship the same workload through a smaller base than the workload would otherwise need, or through fewer calls than the un-customized version requires. Both translate to lower cost per accepted output.
There are three operating models worth distinguishing. Supervised fine-tuning (SFT) trains the model on labeled input/output pairs and is the default starting point: cheapest, most predictable, lands accuracy and format compliance gains. Direct Preference Optimization (DPO) trains on preference pairs (which output is better) and is the right lever when the workload has stylistic or judgment dimensions that labels cannot fully express. Group Relative Policy Optimization (GRPO) and similar reinforcement-from-verifier methods train against an automated grader and are useful when the verifier itself is reliable and the workload has multi-step reasoning or tool-use components that SFT/DPO cannot reach. SFT is cheap and predictable; DPO costs roughly 2-3x SFT for the same training run; GRPO is more expensive still and is the most likely to surprise on the downside if the verifier is wrong.
Multi-LoRA hosting changes the deployment economics. A single base model loaded on a managed dedicated endpoint can host dozens of LoRA adapters at the base-model serving price, swapping the adapter in per request. The fixed cost (GPU capacity) is paid once; the variable cost (per-adapter training and storage) is small. This is how a platform team supports 40 product teams' custom models without paying for 40 endpoints.
The naive answer
"Fine-tune everything. The token economics will get better."
Fine-tuning has an upfront cost, an iteration cost, and a sustained operating cost. Many workloads do not earn back the upfront cost. The teams that succeed at fine-tuning treat it as a cost lever for workloads that already have clear evals, a known failure mode, and enough volume for the training cost to amortize. The teams that fail at fine-tuning treat it as a substitute for prompt engineering or model selection.
The better model
Fine-tuning is a cost lever for accuracy-bound workloads and an accuracy lever for cost-bound workloads. Concretely:
If the workload is repair-rate or escalation-rate bound on a large base model, fine-tuning the small base model on the workload's distribution often closes the accuracy gap. The cost lever: serve the smaller fine-tuned model at lower per-token cost with similar accept rate.
If the workload is throughput-bound on a managed dedicated endpoint, a LoRA adapter that reduces output length or eliminates a repair loop directly reduces capacity demand. The cost lever: same GPU footprint, more accepted work per hour.
If the workload is one of many on a shared platform, multi-LoRA hosting is the deployment unlock. A 100+ adapter fleet on one base-model endpoint preserves base-model economics while letting each tenant tune for its own data distribution.
LoRA economics
Training cost is the most-quoted number and usually the least important. Posted LoRA training rates on managed providers run roughly $0.48 per million training tokens for base models up to 16B (Together's published tier), with 17-69B at roughly $1.50/M and 70-100B at roughly $2.90/M. Concretely: a 50K-example dataset at 1,500 tokens per example trains a 7B-class LoRA for under $40. The training cost is rarely the constraint.
The amortization math matters more. At a representative ~$30 training cost on an 8B LoRA, payback at 10K requests/day is under one day if the adapter reduces cost per accepted output by even 5%. At 1K requests/day payback stretches to a week or two. Below ~200 requests/day on the target workload, the training cost is small but the iteration cost (data curation, eval runs, ongoing maintenance) often exceeds the savings.
DPO costs roughly 2-3x the equivalent SFT run because the per-step compute is higher and the preference pairs typically require more curation. GRPO is expensive to a degree that depends heavily on the verifier — a fast reliable verifier makes it tractable; a slow or noisy verifier makes it open-ended.
Readiness scorecard
The 90-9 gap is the visible result of teams attempting fine-tuning without the pre-conditions. About 90% of teams believe fine-tuning will help; about 9% actually deploy it (Menlo Ventures, 2024). The gap is not skepticism; it is missing infrastructure. Before greenlighting a fine-tune, confirm:
Eval suite exists and is representative of production traffic. Without evals, the team cannot measure whether the fine-tune helped.
The workload has a clear, named failure mode. "The model gets answers wrong" is not a failure mode; "the model uses the wrong title format for clinical-prior-auth requests in California Medicaid contracts" is.
Labeled or preference-pair data exists at scale, or can be generated from production traces with reasonable curation. Most teams under-budget data preparation by 2-5x.
The base model's license permits the fine-tune and the resulting weights' intended use. Several open-model licenses constrain commercial use of derivatives.
Volume is high enough that the per-month operating-cost reduction exceeds the per-month iteration cost (data refresh, eval re-runs, drift monitoring, periodic retraining as the base model evolves).
Hidden costs
The training-run cost is the visible line. The hidden costs add up to 2-5x compute for most production fine-tuning programs:
Data preparation: cleaning, deduplication, format normalization, sensitive-data scrubbing. For most teams this is the largest single cost line.
Eval infrastructure: building production-representative test sets, automated grading, regression tracking across fine-tune iterations.
Drift maintenance: a fine-tune trained on Q1 traffic may degrade on Q3 traffic if the workload shifts. Plan for quarterly retraining on most production fine-tunes.
Base-model refresh: when the provider releases a new base model (Llama 3.1 to Llama 4, GLM-5 to GLM-5.1), the adapter often needs retraining. Weight distributions shift; the LoRA does not transfer cleanly across model versions. Budget retraining time when planning around base-model release calendars.
Eval portability: if the fine-tune is hosted on a different provider than the closed-API baseline, the eval suite needs to be portable across both. Many teams have one-provider eval harnesses that break when they introduce a second.
Failure modes when starting too early
The pattern that surfaces most often: a team fine-tunes before they have evals, sees a 4% improvement on a curated test set, ships it, and discovers two months later that the production failure mode they were targeting was actually a prompt issue. The fine-tune masked the real problem. The cure costs three months: rewrite the prompt, retrain the LoRA on cleaner data, repeat the evals.
Other common failures: training on labels generated by the same model that will serve the adapter (the model learns its own biases); training a LoRA on a base model that the provider deprecates within six months (no transfer to the successor); under-budgeting data prep and producing a fine-tune that improves on the curated set but degrades on the long tail of production traffic; treating DPO as a stronger SFT (preference pairs from non-expert raters can encode style noise rather than the intended quality dimension).
Use fine-tuning when the workload has named failure modes the team can measure, an eval suite that catches them, enough volume that operating-cost reductions amortize the iteration cost (above roughly 1K requests/day on the target workload as a rough floor), and a deployment story (managed dedicated with multi-LoRA, or a self-managed fleet) that does not double the cost line the fine-tune is supposed to reduce. Default to SFT first; reach for DPO only when stylistic or judgment dimensions need it; reach for GRPO only when SFT/DPO have hit a ceiling and a reliable verifier exists.
Do not fine-tune as a substitute for the funnel from Chapter 15. A fine-tune is a refinement of a model that already passed feasibility, capability, and economic evaluation. Tuning a wrong-fit base model produces a tuned wrong-fit model.
What to measure
Cost per accepted output, before and after fine-tune, on production-representative traffic (not the curated training set)
Quality gate pass rate, repair rate, and escalation rate delta
Training cost amortization period at current volume
Data preparation hours and ongoing curation cost
Eval suite coverage on the fine-tuned model's failure modes
Drift indicators (eval pass rate over time on a held-out monthly slice)
The provider deprecates the base model and the LoRA does not transfer to the successor. Plan for retraining; keep the training data portable.
The fine-tune improves on the eval set but degrades on the long tail. Symptoms: escalation rate stays flat, quality complaints rise on edge cases. Treatment: expand the eval set, retrain.
Multi-LoRA scheduling under load. Adapter swapping has overhead; at high QPS with many adapters in rotation, latency variance increases. Measure under realistic adapter-mix conditions before committing to multi-LoRA at scale.
The base model improves faster than the fine-tune retrains. By the time the team has shipped a polished SFT on the v1 base, the provider has shipped v2 that matches the tuned-v1 model unfine-tuned. The fine-tune was right at training time and wrong at deployment time.
Use fine-tuning when: evals exist, the failure mode is named, the volume justifies it, and the deployment story is multi-LoRA on managed dedicated or a self-managed fleet that already pays its operational tax for other reasons. Skip it when: the team is still arguing about what "good" looks like, evals are aspirational, or the workload is below a few thousand requests per day on the target task.
Calculator hook
The fine-tune amortization analysis takes: base-model serving cost, fine-tune training cost, expected cost-per-accepted-output reduction, monthly volume on the target workload, and ongoing iteration cost. It outputs payback period, sensitivity to volume changes, and the break-even monthly volume below which the fine-tune does not earn its keep.
Chapter 20: Self-Manage GPUs
The field problem
Eight H100 GPUs, leased from a cloud provider at $25,500/month on a 3-month reservation. vLLM deployed, fine-tuned 70B model loaded, traffic flowing. The first week goes well. Then:
A vLLM update changes the default scheduler policy. p99 latency doubles. The team rolls back, pins the version, and adds the update to a manual review queue.
A KV cache fragmentation bug causes OOM at 36 hours. The team adds rolling restarts to a cron job.
A customer reports quality regression on rare entity names after the team enables FP8 KV quantization. The team adds continuous log analysis.
The on-call engineer gets paged at 2am for a GPU that dropped from the NVLink domain. The team writes a health check script.
Each issue is solvable. The aggregate is an operating-model commitment that the cost comparison did not include.
The mechanism
Self-managing GPUs converts inference from a service purchase to an infrastructure operation. The economic promise is clear: GPU rental at $2-4/GPU-hour is cheaper than managed endpoints or serverless APIs per token at high utilization. The economic reality includes labor, tooling, reliability engineering, and opportunity cost.
The loaded cost of self-managed inference:
The naive answer
“GPU rental is $2.50/hour. At 1,000 tokens/sec, that is $0.009/MTok. Cheaper than any API.”
The $0.009/MTok is raw GPU cost divided by peak throughput. It does not include: the serving stack, the deployment pipeline, the monitoring, the on-call rotation, the KV cache management, the model updates, the capacity planning, the networking, the redundancy, the incident response, or the quality regression detection.
The better model
Self-managed GPUs are a valid choice when three conditions are met:
Volume has to justify the fixed cost. An engineering team capable of operating self-managed inference at production quality is a fixed cost that scales sub-linearly with fleet size. The realistic floor is 2-3 loaded FTE: an on-call rotation needs at least two engineers (one alone burns out in a quarter), the deployment pipeline needs an owner through serving-engine upgrades, KV / scheduler / quantization tuning is its own specialization, and quality-regression detection (continuous log review) is a recurring process. At FAANG-comparable rates that is roughly $60,000-$90,000/month of loaded engineering cost regardless of whether the fleet is 8 GPUs or 80. At 10 GPUs that is a 30-50% operational tax on GPU spend, which usually erases the cost advantage over managed dedicated. At 100 GPUs the same team can still cover it and the tax falls to 5-10%. The fleet scale at which self-managed economics work is around 50 GPUs sustained, not 10. Below 50, managed dedicated almost always wins on total cost.
2. The team has the operational capability. Running inference at production quality requires: a serving engine (vLLM, SGLang, TensorRT-LLM, or equivalent), a deployment pipeline, model loading and rollback, health checks, autoscaling, KV cache monitoring, latency monitoring, quality monitoring, and incident response. If the team does not have these capabilities, the learning curve is part of the migration cost.
3. The workload benefits from control. Self-managed GPUs allow tuning that managed endpoints may not expose: scheduler policy, batch size limits, KV cache configuration, prefix cache TTL, tensor parallelism strategy, quantization settings, custom model merges, and hardware selection. If the workload needs this control to meet quality or latency targets, self-management is not just cheaper—it is necessary.
Do not self-manage GPUs to save money on a small workload. Self-manage when the workload is large enough to amortize the engineering overhead, the team has operational infrastructure experience, and the workload benefits from serving-stack control that managed endpoints do not provide. Use the dedicated utilization gate from Chapter 18 as a prerequisite: if the workload does not clear the managed dedicated break-even, it will not clear the self-managed break-even either (self-managed has higher operational overhead even if the GPU-hour price is lower).
What to measure
Total GPU fleet cost including networking, storage, and redundancy
Engineering time on inference operations (deployment, monitoring, incidents, upgrades)
Goodput under SLO per GPU (same measurement as managed dedicated)
Incident frequency and mean time to resolve
Model update frequency and rollback rate
Idle GPU hours and capacity utilization by hour
Self-managed breaks when the team underestimates operational burden. Common failure modes:
Serving-engine upgrades that change behavior (scheduler, memory management, API surface)
GPU hardware failures that require vendor coordination and spare capacity
Multi-tenant workloads that need isolation, quotas, and priority scheduling
Compliance requirements that need audit logs, access controls, and data handling documentation
Scaling events that outpace the team’s capacity planning
The reversion path from self-managed to managed dedicated or serverless is easier than the forward path. Keep the eval suite and prompt configurations portable. If self-management becomes more expensive than the alternatives—because the team is too small, the workload shrinks, or managed offerings improve—revert.
When self-managed stops earning its keep
Self-management is not a permanent state. The conditions that made it economic in one quarter can erode in the next. Treat the following as devolution signals—evidence that the fleet should move back to managed dedicated (or, for the bursty slice, to serverless):
Sustained fleet below ~50 GPUs. The engineering floor of 2-3 loaded FTE does not shrink when the fleet does. If the workload contracts and the fleet runs below ~50 GPUs for two consecutive quarters, the operational tax climbs back above 30% of GPU spend and managed dedicated wins on total cost.
On-call burnout or attrition. If the on-call rotation drops below two engineers, or page volume drives sustained attrition, the team is no longer running infrastructure—it is being run by it. This is a leading indicator of incidents, not a cost problem.
Specialist headcount cannot be filled. If a serving-engine, scheduler, or quantization specialist seat sits open for more than a quarter, the team is accumulating tuning debt. Managed dedicated absorbs that specialization for you.
Serving-engine drift two majors behind. If the deployed vLLM, SGLang, or TensorRT-LLM version falls two major releases behind upstream, the gap between your performance and what a managed provider can offer on the same hardware widens every quarter. At two-majors-behind, the migration cost back to managed is usually less than the goodput delta.
Quality-regression detection stops keeping up. If continuous log review or eval-on-production drifts to weekly (or monthly) cadence because the team is firefighting, you have lost the quality monitoring that justified self-management in the first place.
When two or more of these fire, run the dedicated utilization gate from Chapter 18 against managed dedicated pricing on the same fleet. Devolution is not failure—it is the same evidence-based decision that justified self-management in the first place, applied in the other direction.
Calculator hook
self-managed variant: GPU fleet cost, engineering overhead allocation, measured goodput, incident cost. Sensitivity: engineering FTE cost, GPU utilization, goodput per GPU.
Chapter 21: Multi-Source And Background Agent Routing
The field problem
Two providers serve the same open model. The routing layer sends 70% of traffic to the cheaper provider and 30% to the faster provider. The reported cost reduction: 25%. Six months later, an audit reveals:
The cheaper provider has a 4% higher error rate, causing silent retries that the routing layer handles but the cost model does not attribute.
The two providers return slightly different outputs for the same prompt because they run different quantization and serving configurations. The eval suite, built against one provider’s outputs, does not catch the quality delta.
Cache hit rate dropped from 72% to 45% because traffic is split across two providers, each with independent prefix caches.
Incident response is harder because each provider has different monitoring APIs, different error codes, and different support channels.
The 25% “savings” is 10% savings with 15% hidden cost.
The mechanism
Multi-source architecture routes workloads across multiple providers or deployment surfaces. The promise is cost optimization, reliability through redundancy, and reduced vendor lock-in. The reality includes measurement complexity, cache fragmentation, prompt-portability overhead, and incident surface area.
Multi-source works when the measurement discipline is strong enough to catch the hidden costs. It fails when the team optimizes the visible cost (token price) while ignoring the invisible costs (error rate delta, cache fragmentation, eval gap, operational overhead).
The naive answer
“Use two providers. Route to the cheapest one. Fail over to the other.”
This works for failover. It does not work for cost optimization unless the cost measurement includes: per-provider error rate, per-provider quality gate pass rate, per-provider cache hit rate, per-provider latency distribution, and the routing layer’s own cost and complexity.
The better model
Multi-source adds value in three scenarios:
1. Failover. A secondary provider handles traffic when the primary is down. This is reliability engineering, not cost optimization. The secondary provider may cost more per token. That is acceptable if the alternative is zero availability. Measure: failover activation rate, failover quality delta, failover latency delta.
2. Workload-level routing. Different workloads go to different providers based on the routefit matrix. Support chat goes to Provider A. Batch extraction goes to Provider B. Each workload stays on one provider, preserving cache locality and eval consistency. Measure: per-workload LCPR on the assigned provider.
3. Cost arbitrage within one workload. This is the hardest to do correctly. Traffic for a single workload splits across providers. Requires: identical model (or model family with validated quality parity), per-provider quality measurement, per-provider cache strategy, and per-provider incident response. Measure: LCPR by provider for the same workload, including all hidden costs.
Background agents and memory-hierarchy routing
Multi-source routing has a specific variant that applies to agent workloads: routing different steps of the same agent trajectory to different serving surfaces based on latency tolerance.
Coding agents and an interactive chat product share the same infrastructure. The chat product needs 200ms TTFT and streaming at 60 tokens per second. The coding agents run tasks that take 5-30 minutes, make 20-80 model calls per task, and do not need streaming. Both workloads compete for the same GPU capacity, the same rate limits, and the same prefix cache. Chat latency degrades during peak agent activity. The agents are paying premium real-time prices for work that could tolerate 10x higher latency.
The distinction between answer inference and agentic inference from Part 3 Chapter 12 has an architectural consequence. Background agents—coding assistants, research pipelines, document processing workflows, data enrichment tasks—can operate on a different cost-latency frontier than interactive chat:
1. Cheaper compute tier. Background agents can use batch APIs, off-peak scheduling, or lower-priority processing tiers. Google’s Flex tier, Anthropic’s Batch API, and similar surfaces offer 50% discounts for async processing (see vendor pricing pages, May 2026). The agent does not need real-time response.
2. Memory-hierarchy optimization. Agent state—KV cache, conversation history, tool outputs, intermediate results—can be stored across a memory hierarchy rather than held in expensive GPU HBM. Active KV in HBM for the current inference call. Recently used context in host memory or SSD for fast retrieval. Historical context in object storage or databases. The cost per bit drops by orders of magnitude across tiers.
3. Checkpointing and resumability. A 30-minute agent task should not restart from scratch if interrupted. Checkpointing intermediate state allows the task to resume from the last checkpoint rather than replaying the entire trajectory. This trades storage cost for inference cost.
4. Queue-based execution. Agent tasks can be queued, prioritized, and executed when capacity is available rather than demanding immediate GPU allocation. This smooths demand, improves fleet utilization, and enables cost-aware scheduling.
Separate the serving surfaces by latency requirement:
Surface |
Latency requirement |
Billing mode |
Use for |
|---|---|---|---|
Real-time streaming |
TTFT < 500ms |
Per-token, standard |
Interactive chat, support |
Standard async |
E2E < 60s |
Per-token, standard |
Single-call agent steps needing fast iteration |
Batch |
Completion within 24hr |
Per-token, 50% discount |
Eval runs, extraction, enrichment |
Background queue |
Best-effort, priority-based |
Per-token or per-GPU-hour |
Multi-step agent tasks, research pipelines |
For a coding agent: the interactive planning step where the human reviews the approach may need streaming. The file reading, test execution, and code writing steps do not. The repair loop after a test failure does not. Routing these steps to cheaper compute surfaces can reduce task lifecycle cost by 30-50% without affecting user experience, because the user is not watching each intermediate step.
Multi-source for failover is straightforward and recommended for critical workloads. Multi-source for workload-level routing follows naturally from the routefit matrix. Multi-source for cost arbitrage within a single workload requires per-provider LCPR measurement including error rates, cache hit rates, quality gates, and operational overhead. If you cannot measure per-provider LCPR, you cannot prove the arbitrage works.
For agent workloads specifically, decompose the task trajectory into steps. Route human-facing steps to real-time endpoints. Route machine-facing steps (tool execution, file operations, eval checks, repair loops) to batch or async endpoints where the latency tolerance allows. The savings come from using the right cost-latency tier for each step, not from making everything fast or everything cheap.
What to measure
LCPR by provider per workload (not blended)
Error rate delta between providers
Cache hit rate by provider (split traffic fragments caches)
Quality gate pass rate by provider
Incident frequency and resolution time by provider
Routing layer latency and cost overhead
Task lifecycle cost by step type (human-facing vs machine-facing)
Batch-eligible share of agent model calls
Queue depth and wait time for background tasks
Checkpointing overhead vs restart cost
Multi-source becomes harmful when providers run different quantizations or serving configurations of the “same” model, cache fragmentation costs more than the per-token savings, the routing layer adds latency that violates SLOs, or the team cannot maintain eval suites covering both providers” output distributions.
Background agent architectures break when the user expects real-time visibility into every agent step, task state is too large to checkpoint efficiently, the latency tolerance assumption is wrong, or rate limits on batch APIs prevent the agent from making progress.
Calculator hook
multi-source variant: per-provider cost, per-provider quality gates, cache fragmentation estimate, routing overhead. Agentic variant with step-level routing: per-step latency tier, batch discount eligibility, checkpoint cost, queue economics. Sensitivity: provider split ratio, batch-eligible share of model calls, latency tolerance by step.
Chapter 22: Migration Gates And Reversion Signals
The field problem
Four months to migrate from a closed API to self-managed GPUs. Six months after that, the serving engine vendor releases a breaking change, two GPUs fail in the same week, and the on-call engineer quits. Nobody defined what conditions would trigger reverting to a managed service. The decision to stay self-managed is made by inertia, not analysis.
The mechanism
Migration gates are preconditions that must be met before a migration begins. Reversion signals are conditions that, when triggered, indicate the migration should be partially or fully reversed. Both should be defined before the migration starts—not during an incident.
Migration gates
Every migration should pass these gates before proceeding:
Gate |
Question |
Evidence required |
|---|---|---|
Eval readiness |
Does a production-representative eval suite exist for the workload? |
Eval set with coverage analysis |
Quality floor |
Does the candidate route meet the minimum quality threshold? |
Eval results on candidate route |
LCPR measurement |
Has LCPR been measured (not estimated) on the candidate route? |
Trace-derived LCPR with evidence label |
Compliance clearance |
Does the candidate route meet data residency, model rights, and contractual requirements? |
Compliance checklist with owner sign-off |
Operational readiness |
Does the team have monitoring, deployment, and incident response for the candidate route? |
Runbook, alerting, and on-call coverage |
Rollback path |
Can the workload revert to the previous route within a defined time window? |
Tested rollback procedure |
Traffic ramp plan |
Is there a staged traffic ramp with quality checkpoints at each stage? |
Ramp plan with stop criteria |
Missing any gate does not prohibit migration. It increases risk. The team should decide explicitly which gates to skip and what additional monitoring compensates.
Reversion signals
Define these before migration. When any signal fires, evaluate whether to revert:
Quality signals:
Eval pass rate drops below the quality floor for 2 consecutive measurement windows
Human escalation rate increases by more than 5 percentage points
New failure modes appear that the eval suite does not cover (catch through continuous log review)
Cost signals:
LCPR on the new route exceeds LCPR on the old route for 2 consecutive weeks
Cache hit rate drops below the break-even threshold (from Part 2 Chapter 7)
Utilization falls below
u_requiredfor dedicated endpoints (from Chapter 18)
Operational signals:
Incident frequency on the new route exceeds the old route by 2x
Mean time to resolve incidents exceeds the team’s SLO
On-call burden increases to the point of team attrition risk
Latency signals:
p99 TTFT or TPOT exceeds SLO for more than 1% of measurement windows
Latency variance (p99/p50 ratio) increases by more than 50%
The naive answer
“We migrated. It’s done. Why would we go back?”
Migration is not a one-way door. The conditions that made migration economic can change: the old provider cuts prices, the new route’s quality degrades, the operational burden exceeds estimates, the team shrinks, the workload changes shape. Reversion is not failure. Reversion is evidence-based decision-making applied to the migration itself.
The better model
Treat migration as a staged experiment with explicit stopping criteria:
Stage 1: Shadow traffic (5-10%). Route a small share of traffic to the new route. Compare quality, latency, and LCPR against the baseline. Do not expose to users. Duration: 1-2 weeks.
Stage 2: Canary traffic (10-25%). Expose a controlled share of users to the new route. Monitor quality signals and user feedback. Duration: 2-4 weeks.
Stage 3: Majority traffic (50-90%). If canary passes, ramp to majority. The old route remains available as fallback. Monitor all reversion signals. Duration: 4-8 weeks.
Stage 4: Full migration (100%). Decommission the old route only after the reversion signal monitoring has been clean for the full Stage 3 duration. Keep the rollback path tested for at least one more quarter.
Each stage has stop criteria. If any reversion signal fires during a stage, hold the ramp. If the signal persists, revert to the previous stage. Document the reason. Revisit when the condition is resolved.
Define migration gates and reversion signals before starting the migration. Write them down. Assign owners to each signal. Review them at each stage transition. Reversion is not a political decision—it is a measurement outcome. The team should feel as comfortable reverting as advancing, because both are evidence-driven.
What to measure
All migration gate evidence (eval coverage, quality floor, LCPR, compliance, operational readiness)
All reversion signals (quality, cost, operational, latency) at defined measurement cadence
Traffic split and ramp stage
Time at each stage
Reversion events and root causes
Migration gates and reversion signals break when:
The measurement infrastructure is not in place. You cannot monitor reversion signals you cannot measure. Invest in measurement before migration.
Political pressure overrides evidence. A VP who championed the migration may resist reversion even when the signals are clear. Define the decision process before the migration starts.
The old route is decommissioned too early. If the rollback path is dismantled during Stage 3, reversion becomes a new migration instead of a switch.
The signals are too sensitive or too insensitive. Too many false alarms cause signal fatigue. Too few miss real regressions. Calibrate thresholds against the old route’s baseline variance.
Involuntary migration: when the provider moves first
Most of this part has treated migration as a decision the team initiates. The other category — the one teams underprepare for — is when the provider deprecates a model mid-contract. The deprecation notice can be 90 days, 180 days, or in two specific cases observed in 2025, less than 30 days for a model the team had treated as production-stable for two years. Reverting to a prior version is not always available; sometimes the prior version is also being deprecated on the same notice.
Three things to negotiate before a deprecation hits, not after:
Minimum deprecation notice period. Industry baseline is 90 days; some enterprise contracts negotiate 180 or 365. Without this clause, the provider can default to whatever their public terms say, which is usually the shortest interval the provider can defend.
ABI stability for the immediate successor. A successor model with a different output distribution is functionally a migration even when the API surface is identical. Contractually, "successor" should mean a model that passes the team's eval suite at a threshold the team specifies, not whatever the provider names next.
Right to test the successor on real traffic for a stated window before the deprecation date. Three weeks is the minimum to catch quality regressions; six weeks lets the team run an eval on a representative production sample and tune prompts if needed.
Operational preparation matters as much as the contract. The most expensive part of an unplanned migration is the prompt rewrite, which is not paid for by the provider and is not budgeted by the team. Keep a portable prompt format that can target two providers without translation. Keep a recent eval baseline (less than 60 days old) on every production workload, so the successor can be tested against the same bar the incumbent passes. Keep a rollback path that does not depend on the deprecated model continuing to be available — a route that has not been tested in six months is functionally not a route.
An anonymized case: a Series-B SaaS team running answer-drafting on a frontier model received a 70-day deprecation notice for the model version that anchored their fine-tuned prompts. The successor existed but produced 14% lower accept-rate on the team's eval suite. The team had two options: rewrite 28 prompt templates to recover the accuracy gap (estimated 6 engineer-weeks; risk of new regressions); or switch to a competing provider whose model passed the eval at a similar accept rate (estimated 4 engineer-weeks for the integration plus a one-time eval cost of ~$11K). They picked the competitor. The side-finding: the team had been planning to add a second provider for resilience for six months and never prioritized it. The forced migration converted a deferred infrastructure project into a delivered one, and the prompt portability work paid for itself within the next quarter on a separate workload.
Calculator hook
The migration gate checklist is a prerequisite to the calculator’s route comparison. The calculator should flag when inputs lack evidence labels or when the LCPR estimate is based on pricing-page arithmetic rather than measured traces. The reversion signal dashboard tracks post-migration LCPR, quality, and operational metrics against the pre-migration baseline.
Part 4 Summary
Part 4 established that migration is not one decision—it is a sequence of increasingly committed choices, each with its own economics, failure modes, and reversion paths.
Concept |
Where defined |
What it does |
|---|---|---|
Model candidate funnel |
Chapter 15 |
Eliminates infeasible options before comparing cost |
routefit matrix |
Chapter 15 |
Maps workload identity to feasible routes with measured LCPR |
Do Nothing payback |
Chapter 16 |
Proves migration cost exceeds savings within the planning horizon |
Serverless open migration |
Chapter 17 |
First step: prompt portability, eval work, no infra ownership |
Dedicated utilization gate |
Chapter 18 |
|
Fine-tuning and post-training as cost levers |
Chapter 19 |
Smaller fine-tuned models can clear the quality floor at lower LCPR; treat fine-tuning as a route option, not a model upgrade |
Self-managed operating model |
Chapter 20 |
GPU ownership requires engineering team, on-call, tooling |
Multi-source and agent routing |
Chapter 21 |
Per-provider LCPR, cache fragmentation, step-level latency-tier routing |
Migration gates and reversion |
Chapter 22 |
Preconditions before migration; signals that trigger rollback |
The progression follows a commitment gradient:
Each step requires the previous step’s capabilities plus new ones. Serverless open requires prompt portability and evals. Managed dedicated requires utilization discipline. Self-managed requires operational capability. Multi-source requires per-provider measurement. Skipping a step does not skip its requirements.
Part 5 takes these deployed workloads and asks the operating question: how do you know the decision is still correct? Baselines, evals, benchmarks, observability, trace-to-loaded-cost reconciliation, incidents, review packs, and forecasting.
Evidence Notes for Part 4
# |
Claim |
Label |
Source |
Chapter |
|---|---|---|---|---|
1 |
Together H100 $3.99/hr on-demand, H200 $5.49/hr, B200 $9.95/hr |
PUBLIC |
Together AI pricing page, accessed 2026-05-12 |
18, 20 |
2 |
Together reserved pricing: H100 91-180 day $2.95/hr, H200 $3.29/hr |
PUBLIC |
Together AI pricing page, accessed 2026-05-12 |
18 |
3 |
Lambda H100 on-demand $4.29/hr (1x GPU); reserved pricing requires custom negotiation |
PUBLIC |
Lambda pricing page, accessed 2026-05-13 |
18 |
4 |
Fireworks H100 $7/hr, H200 $7/hr, B200 $10/hr |
PUBLIC |
Fireworks AI pricing page, accessed 2026-05-12 |
20 |
5 |
Bedrock Custom Model Unit v1.0 $0.05718/min with 5-min minimum billing |
PUBLIC |
AWS Bedrock pricing page, accessed 2026-05-12 |
18 |
6 |
Batch APIs ~50% discount across OpenAI, Anthropic, Google, Fireworks, Groq |
PUBLIC |
Official pricing pages, accessed 2026-05-12 |
21 |
7 |
Google Flex/Batch tiers offer discounted processing |
PUBLIC |
Google Gemini pricing page, accessed 2026-05-12 |
21 |
8 |
Dedicated utilization gate formula |
DERIVED |
From LCPR normalized to capacity economics |
18 |
9 |
Migration gate checklist |
OPINION |
Author synthesis from production migration patterns |
22 |
10 |
Reversion signal categories |
OPINION |
Author synthesis from production incident patterns |
22 |
11 |
Payback period calculation for Do Nothing |
DERIVED |
Standard ROI framework applied to inference migration |
16 |
12 |
Background agent latency-tier decomposition |
DERIVED |
From Chapter 12 answer-vs-agentic distinction applied to serving surfaces |
21 |
13 |
Multi-source cache fragmentation risk |
DERIVED |
From Part 2 cache-locality analysis applied to multi-provider routing |
21 |
14 |
Self-managed operational overhead categories |
OPINION |
Author synthesis from infrastructure operations |
20 |
15 |
Model candidate funnel stages |
DERIVED |
Standard vendor evaluation applied to inference model selection |
15 |
16 |
routefit matrix design |
DERIVED |
From workload identity fields mapped to serving surfaces |
15 |
17 |
All worked example numbers |
SYNTHETIC |
Illustrative, shaped by real billing semantics and pricing snapshots |
17, 18 |
18 |
Cold start reduced from tens of minutes to ~50 seconds via CUDA C/R |
PUBLIC |
Modal, “Truly Serverless GPUs,” 2026 |
18 |