Karthika Raghavan

llm-d in Action — EPP Prefix Cache Routing and What It Actually Means

2026-04-19T00:00:00+00:00

In the previous post I documented everything that broke during the llm-d deployment on a Lambda Labs GH200. Ten gotchas, two false starts, one ticking hourly bill.

This post is the payoff — and more than that, it’s an attempt to say something useful beyond “look at these numbers.” The numbers are good. But the more interesting question is what they reveal about inference architecture that applies regardless of which GPU you’re running on or which serving framework you’ve chosen.

Hardware: Lambda Labs GH200 480GB, ARM64
Model: Qwen/Qwen3-0.6B
Stack: llm-d v0.4.0, vllm/vllm-openai:latest, Istio gateway, kube-prometheus-stack
Load testing: Locust with tenant simulation
Observability: Prometheus + Grafana (llm-d Performance Dashboard, llm-d vLLM Overview)

The Hardware Deserves More Than a Footnote

Most inference benchmarks treat hardware as a footnote — “tested on an A100” — without explaining why the hardware choice matters architecturally. The GH200 is worth understanding properly because it represents a design direction that the industry is converging on, and it changes some assumptions about what’s possible in inference.

One quick note before the specs: the GH200 is technically a unified memory architecture — just like the M4 Mac Mini from the previous post. CPU and GPU share the same address space without copying data between them. The Mac Mini has 16GB at roughly 200 GB/s. The GH200 has 576GB at up to 4,000 GB/s.

The GH200 is technically unified memory — just like the M4 Mac Mini. One costs $799. The other costs $2.29 per hour and will make you feel considerably better about your Mac Mini purchase. Same architectural principle, considerably different ambitions — and as the numbers in this post will show, considerably different outcomes.

Grace Hopper — One Package, Two Chips, One Memory Bus

The GH200 is not a GPU. It’s a Grace Hopper Superchip — NVIDIA’s Grace CPU (72 ARM Neoverse V2 cores) and an H100 Hopper GPU die, connected on the same package via NVLink-C2C (chip-to-chip).

NVIDIA GH200 Grace Hopper Superchip architecture. The Grace CPU (72 ARM Neoverse V2 cores, 480GB LPDDR5X) and Hopper GPU (H100, 96GB HBM3e) are connected on the same package via NVLink-C2C at 900 GB/s bidirectional — 7× the bandwidth of PCIe Gen5. This chip-to-chip interconnect is what makes unified CPU+GPU memory a practical reality rather than a marketing claim. Source: NVIDIA GH200 product page.

The critical number is the NVLink-C2C bandwidth: 900 GB/s bidirectional. For context:

Interconnect              Bandwidth
──────────────────────────────────────────────
PCIe 5.0 x16 (discrete)   128 GB/s
NVLink-C2C (GH200)         900 GB/s    ← 7× faster than PCIe Gen5
HBM3e on-chip (GH200)    4,000 GB/s
LPDDR5X (Grace CPU)        512 GB/s

Source: NVIDIA GH200 Grace Hopper Superchip product page and GH200 architecture whitepaper. The 7× PCIe Gen5 figure is NVIDIA’s own stated specification — “NVLink-C2C delivers up to 900 GB/s total bandwidth. This is 7x higher bandwidth than x16 PCIe Gen5 lanes.”

On a conventional discrete GPU system — an A100 or H100 in a PCIe slot — the CPU and GPU have separate memory pools connected by a 128 GB/s bus. Moving data between them is expensive. The GPU can’t efficiently use CPU DRAM for KV cache overflow because the PCIe bandwidth is 31× slower than the GPU’s on-chip HBM bandwidth (4,000 ÷ 128 = 31.25×). Any spill to CPU memory becomes a bottleneck.

On the GH200, the Grace CPU’s 480GB of LPDDR5X memory is accessible to the Hopper GPU at 900 GB/s over NVLink-C2C. That’s not as fast as on-chip HBM3e, but it’s fast enough to be genuinely useful. The result is a unified addressable memory space — the GPU sees up to 576GB total (96GB HBM3e + 480GB LPDDR5X) — at bandwidths that make overflow to CPU DRAM a viable architectural choice rather than a last resort.

Left: discrete GPU (A100/H100 PCIe) — CPU and GPU have separate memory pools connected by 128 GB/s PCIe Gen5. KV cache overflow to CPU DRAM is 31× slower than on-chip HBM, making it unusable in practice. Right: GH200 Grace Hopper — CPU and GPU share a unified 576GB address space (96GB HBM3e + 480GB LPDDR5X) connected at 900 GB/s via NVLink-C2C. KV cache tiering to CPU memory becomes a viable architectural option, not a last resort. This is why llm-d's tiered KV offloading feature maps directly onto GH200 hardware.

Why This Matters for LLM Inference

At scale, inference systems are typically memory-bandwidth bound, not compute bound. This isn’t universal — small models at low concurrency can be compute-bound during prefill, and some workloads shift the bottleneck elsewhere. But for production multi-tenant serving, the binding constraint is almost always memory: how fast the hardware can load model weights and KV cache tensors for each forward pass. Every decode step reads the full KV cache. On a 7B model with a 4K context window and 32 concurrent requests, that’s dozens of gigabytes moving through the memory subsystem per second — and the rate of that movement, not the number of CUDA cores, determines your throughput ceiling.

The GH200’s NVLink-C2C doesn’t solve this — HBM3e is still the primary memory bandwidth for active inference. But it changes the economics of KV cache management. Tiered KV storage (hot blocks in HBM, warm blocks in LPDDR5X, cold blocks evicted to NVMe) becomes viable in a way it isn’t on PCIe-connected systems. llm-d’s architecture diagram from the previous post already shows tiered KV cache offloading as a first-class feature — the GH200 is the hardware that makes that design practical.

For these experiments, we didn’t exercise KV cache tiering. The model is small and the cache pressure is low. But the architecture is there, and it’s why the GH200 was the right choice for running the full llm-d stack rather than a conventional discrete GPU setup.

One Mental Model Before the Numbers

Before looking at a single metric, here is the frame I’d encourage you to carry into any inference system analysis:

LLM inference is a memory scheduling problem that looks like a compute problem.

Every team I’ve seen optimise inference focuses first on compute — GPU utilization, batch size, model quantization. These matter. But the real leverage is in memory: how much of the KV cache is warm, how often prefill recomputation is avoided, how well the scheduler keeps the right data in the right tier. The system that wins at scale is the system that does the least unnecessary work — and unnecessary prefill recomputation is the biggest source of waste in multi-tenant inference.

This is what EPP prefix cache routing is solving. Not “make the GPU faster.” Make the system do less work.

What These Experiments Actually Test

Experiments 1–4 use the inference-scheduling guide from the llm-d repo. This deploys a single decode pod that handles both prefill and decode — aggregated serving. The values.yaml is unambiguous:

prefill:
  create: false   # one pod does everything

The architecture is:

Every request →  Istio Gateway
                      │
                      ▼
                 EPP (Endpoint Picker Plugin)
                 ┌────────────────────────────────┐
                 │  prefix-cache-scorer           │
                 │  queue-depth-scorer            │
                 │  kv-utilization-scorer         │
                 └────────────────────────────────┘
                      │
                      ▼
               Single vLLM decode pod
               (prefill + decode, one process)
               KV cache lives here

With one pod, there is no cross-pod routing decision to make. The EPP’s job here is narrower: route requests with matching prefix hashes back to this pod consistently, so its KV cache stays warm. Round-robin with one pod is also consistent routing — but a naive load balancer doesn’t know about prefix hashes, so it can’t make cache-aware decisions when you add a second pod later.

What this experiment answers: Does EPP prefix-cache-aware routing actually maintain high cache utilization under realistic multi-tenant load? Does the system hold TTFT stable as concurrency grows?

What it doesn’t answer yet: What happens when you separate prefill and decode onto dedicated pods. That’s the next post.

This single-pod result is the baseline. Everything in the P/D disaggregation post gets measured against what we establish here.

The Load Test — Simulating Multi-Tenant Traffic

The Locust script simulates two traffic types in a 4:1 ratio:

# llmd-locust.py
MODEL = "Qwen/Qwen3-0.6B"

# Three tenant personas — each with a distinct long system prompt
TENANTS = [
    "You are a financial analyst assistant specializing in market data. " * 5,
    "You are a DevOps engineer assistant specializing in Kubernetes and CI/CD. " * 5,
    "You are a data scientist assistant specializing in ML pipelines. " * 5,
]

class LLMDUser(HttpUser):
    wait_time = between(0.5, 2)

    def on_start(self):
        self.tenant_prompt = random.choice(TENANTS)  # sticky for session

    @task(4)
    def tenant_request(self):
        # Repeat system prompt → cache hit opportunity
        self.client.post("/v1/chat/completions", json={
            "model": MODEL,
            "messages": [
                {"role": "system", "content": self.tenant_prompt},
                {"role": "user", "content": random.choice([
                    "Summarize the key points.",
                    "What should I focus on?",
                    "Give me 3 recommendations.",
                ])}
            ],
            "max_tokens": 80,
        }, name="tenant_session")

    @task(1)
    def cold_request(self):
        # No system prompt → cold cache, no prefix benefit
        self.client.post("/v1/chat/completions", json={
            "model": MODEL,
            "messages": [{"role": "user",
                          "content": f"Question {random.randint(1,1000)}"}],
            "max_tokens": 50,
        }, name="cold_request")

The design is deliberate. Each simulated user picks a tenant persona at startup and sticks with it — the same ~200-token system prompt on every request. This is representative of real multi-tenant SaaS inference: each customer has a system prompt that defines their product’s persona, and it’s identical on every call. The cold_request tasks (1 in 5) have no system prompt and get no cache benefit — they’re the baseline comparison within the same run.

Requests travel the full path: Mac → SSH tunnel → kubectl port-forward → Istio gateway → EPP → vLLM pod. No shortcuts.

The Results

Locust — 26,826 Requests, Zero Failures

Hardware:  GH200 480GB (single decode pod, aggregated serving)
Model:     Qwen/Qwen3-0.6B
Rate:      32 req/s sustained

Task              Requests    Failures   p50    p95    p99    avg
───────────────────────────────────────────────────────────────────
cold_request      5,450       0 (0%)     200ms  270ms  280ms  207ms
tenant_session    21,376      0 (0%)     260ms  330ms  350ms  273ms
───────────────────────────────────────────────────────────────────
Aggregated        26,826      0 (0%)     260ms  330ms  350ms  260ms

Zero failures across 26,826 requests. p99 at 350ms. The system didn’t flinch.

The comparison that matters — same Locust script structure, same traffic intent, from the Mac Mini post:

Mac Mini vLLM (5 users, 3:1 short/long):
  short prompt p50:   1,900ms
  long prompt  p50:  31,000ms
  TTFT P99:          ~7,000ms under load

llm-d on GH200 (tenant simulation, 4:1 tenant/cold):
  tenant_session p50:   260ms    ← 7.3× faster at median
  cold_request   p50:   200ms
  TTFT P99:              34ms    ← ~200× better tail latency

The gap is real, but it deserves honest attribution. Part of it is raw hardware — a GH200 is not an M4 Mac Mini. Part of it is the serving stack — llm-d with EPP routing vs vanilla vLLM. And part of it is the traffic shape — these aren’t identical experiments. What the numbers establish is a clear direction: hardware matters, but the serving architecture amplifies or squanders what the hardware can do.

What Grafana Showed

The Performance Dashboard — Read This One First

llm-d Performance Dashboard during the Locust run. Top-left: TTFT p50 at 15ms, p95 at 19ms — flat throughout. Top-right: ITL p50 at 5ms, p95 at 9.5ms — decode speed is stable. Middle: KV Cache Hit Rate at 80.6%, Per-Pod at 80.6%. Bottom: Request Throughput 28.8 req/s, Request Queue showing active batching without buildup.

Three numbers worth pausing on:

TTFT p50 at 15ms, flat under load. The Mac Mini showed TTFT climbing past 2 seconds at p50 and 7 seconds at p99 under comparable concurrency. Here it stays at 15ms throughout. This is not just hardware — it’s what happens when 80% of your requests skip prefill entirely because their KV blocks are already warm. The system is doing less work per request on average, which is why latency stays stable as concurrency grows.

A common mistake in inference optimisation is treating TTFT stability as a tuning parameter — something you achieve by adjusting batch sizes, memory utilization settings, or scheduler parameters. It isn’t. TTFT stability under load is an architectural property. It follows from having enough cache hit rate to keep the average prefill cost low. Once cache hit rate drops below ~50%, no amount of tuning recovers the tail latency. The right intervention is upstream: better routing, larger KV cache budgets, or separation of prefill and decode workloads.

KV Cache Hit Rate at 80.6%. This is the single most actionable metric in multi-tenant inference. It tells you how much work the system is not doing. At 80.6%, roughly 4 in 5 tenant requests reuse cached KV blocks and skip prefill recomputation. The inverse is the expensive number: 19.4% of requests are cold — those are full prefill operations. In a system with 100 GPU-hours of work per day, improving cache hit rate from 80% to 90% saves 10 GPU-hours. That’s not a performance metric. That’s a cost metric.

ITL at 5ms. Inter-Token Latency is the gap between consecutive tokens during decode — a direct measure of decode throughput. At 5ms per token, that’s 200 tokens per second per request. More importantly it’s flat — it doesn’t increase as the test runs, which confirms there’s no memory pressure or scheduler thrashing affecting the decode phase. When ITL climbs under load, it’s usually a sign that the KV cache is filling and evictions are occurring. Here it’s not.

Prefix Cache Hit Rate — The Number Behind the Number

Diagnostic Drill-Down. Middle: Prefix Cache Hit Rate at 81.1%, Per-Instance line steady at 80% throughout the test — not a burst artifact, sustained. Top: Routing section showing near-zero Idle GPU Time (GPU is fully utilised) and consistent Token Distribution. Bottom: P/D Disaggregation section showing Decode Worker Utilization — the single pod handling all prefill and decode work.

The 81.1% gauge is the aggregate. The Per-Instance time series at 80% is the more meaningful signal — it shows the cache was warm within the first few minutes and held that level for the entire test duration. This is what stable prefix cache routing looks like: not a spike that decays, but a plateau that holds.

What this means at scale: At three tenant profiles with ~200-token system prompts, 81% cache hit rate is achievable on a single pod with a small model. As you scale to 50 tenants, 500, or 5000, the picture changes. The working set of system prompts grows beyond what a single pod’s KV cache can hold. Cache hit rate degrades. TTFT rises. This isn’t a failure of EPP routing — it’s a KV cache capacity problem, and the architectural response is either larger KV budgets per pod, more pods with affinity-based routing, or tiered KV offloading. The GH200’s NVLink-C2C is exactly the hardware that makes that third option viable. This experiment doesn’t exercise it. The architecture supports it.

E2E Latency and Scheduler State

vLLM Overview dashboard. Top-left: E2E Request Latency p50 at 150ms, p95 at 260ms, p99 at 300ms, all flat. Top-right: Token Throughput at ~2,500 tps combined (prompt + generation). Middle-right: Scheduler State — Num Running and Num Waiting both stable, no queue buildup. Bottom-right: Cache Utilization at 0.15% — the 0.6B model leaves the entire KV pool available.

The Scheduler State panel deserves attention. Num Running is the active batch size — how many sequences share a GPU forward pass. Num Waiting is the queue depth. Both staying low and flat means continuous batching is absorbing load without queuing. Compare this to the Mac Mini Locust test where Num Waiting spiked to 5 and TTFT degraded proportionally.

Cache Utilization at 0.15% is a function of model size. Qwen3-0.6B at this concurrency level barely touches the KV pool. The same test with Llama-3-8B would show a fundamentally different curve — and that’s where the GH200’s memory architecture starts mattering in ways that go beyond raw numbers.

Prefill vs Decode — The Two-Phase Separation Made Visible

vLLM Overview lower panels. Bottom-left: Requests Prefill and Decode Time — yellow is prefill, green is decode. Prefill drops as the cache warms during the first minutes of the test, then stabilises. Decode stays flat throughout. Queue Time near zero. Top panels: Request Prompt Length heatmap showing two clusters — short cold requests and longer tenant prompts.

This panel is the most architecturally revealing one in the dashboard. The yellow prefill line and the green decode line on the same chart, from the same pod, tell you everything about why P/D disaggregation exists.

Prefill is compute-bound and bursty. Its cost scales with prompt length. It spikes when a cold request arrives. Decode is memory-bandwidth-bound and steady. Its cost scales with the number of tokens being generated. On the same pod, every prefill spike steals GPU time from active decode sequences — those sequences stall mid-generation while the prefill runs.

In a disaggregated setup, the yellow line comes from a prefill pod pool and the green line from a decode pod pool. Prefill spikes are isolated. Decode runs uninterrupted. TTFT for long prompts no longer delays short prompts waiting in the decode queue.

Here those lines share a pod. The system works well at this scale and concurrency — the numbers prove it. But the architectural tension is visible in the chart. That’s what the next post is about.

TTFT P99 — The Warmup Signature

TTFT P99 over 30 minutes from the Failure and Saturation dashboard. Starts at 39ms (cold cache), drops to 32ms minimum as KV blocks accumulate, stabilises at ~34ms for the remainder of the test. P99 settling — not drifting upward — is the signal of a system that has reached steady state. For context: the Mac Mini TTFT P99 was climbing past 7 seconds under similar concurrent load.

TTFT P99 dropping from 39ms to 32ms and then holding at 34ms is the cache warmup signature in the tail. The first requests from each tenant session are cold — full prefill. As sessions accumulate, KV blocks warm, and the P99 reflects that. The key signal is the plateau: P99 stops dropping once the cache is warm and doesn’t creep upward under sustained load. This is a system that has found equilibrium.

Model Throughput and Queue State

Diagnostic Drill-Down model serving panels. Model Throughput at 4,000–5,000 tps during the Locust run. Request Queue Lengths near zero — requests are served without queuing. Queue Utilization shows brief spikes during burst arrivals that drain immediately — continuous batching absorbing load as designed. KV Cache Utilization at 0.15%.

Model Throughput at 4,000–5,000 tokens per second is what a GH200 looks like under moderate load with a small model. The brief Queue Utilization spikes that drain immediately are continuous batching doing its job — burst arrivals get absorbed into the current decode step rather than queuing. This is the architectural behaviour that separates vLLM from naive sequential servers.

The Complete Benchmark Reference

Hardware:  Lambda Labs GH200 480GB (Grace Hopper)
Model:     Qwen/Qwen3-0.6B
Stack:     llm-d v0.4.0, single decode pod (prefill: create: false)
           EPP: prefix-cache-scorer + queue-scorer + kv-utilization-scorer

─── Locust ───────────────────────────────────────────────────────────
  Total requests:   26,826     Failures:    0 (0%)
  Sustained rate:   32 req/s

  Task             p50    p95    p99    avg    req/s
  tenant_session:  260ms  330ms  350ms  273ms  25.5
  cold_request:    200ms  270ms  280ms  207ms   6.5
  Aggregated:      260ms  330ms  350ms  260ms  32.0

─── Grafana — Performance Dashboard ──────────────────────────────────
  TTFT p50:                 15ms
  TTFT p95:                 19ms
  TTFT P99 (settled):       ~34ms
  ITL p50:                  5ms   (≈ 200 tok/s)
  ITL p95:                  9.5ms
  KV Cache Hit Rate:        80.6% (gauge) / 81.1% (drill-down)
  Request Throughput:       28.8 req/s
  Model Throughput:         4,000–5,000 tps peak
  Cache Utilization:        0.15%

─── vs Mac Mini (Post 2, same script structure) ──────────────────────
  tenant p50:  Mac Mini 1,900ms  →  GH200 260ms   (7.3× faster)
  TTFT P99:    Mac Mini ~7,000ms →  GH200 34ms    (~200× better)

What This Proves — and What It Doesn’t

What it proves:

EPP prefix cache routing achieves 81.1% cache hit rate under realistic multi-tenant load. That’s not a configuration artifact — it’s the result of the EPP scoring prefix hashes and routing tenant sessions to the pod holding their warm KV blocks. The system holds TTFT at 15ms p50 and 34ms p99 under 32 req/s sustained, with zero failures.

More broadly: in multi-tenant inference systems with repeated system prompts, cache hit rate is typically the highest-leverage optimisation variable. GPU utilization, batch size, and quantization all matter — but they reduce the cost of work the system is already doing. Cache hit rate determines how much work gets skipped entirely. At 81%, the system is doing roughly 5× less prefill work than a round-robin deployment that scatters the same tenant’s requests across pods with cold caches. That ratio is hard to match through hardware improvements alone.

What it doesn’t prove:

This is a single pod, a small model, and three tenant profiles. The working set fits comfortably in the KV cache — hence 0.15% utilization. Real multi-tenant deployments have hundreds or thousands of tenant profiles. As the working set grows, cache hit rate degrades. The EPP routing remains correct, but the pod’s KV cache can’t hold every tenant’s prefix simultaneously. The responses to that problem — larger KV allocations, more pods with affinity routing, tiered KV offloading to the GH200’s LPDDR5X — are architectural decisions that require knowing the working set size and access pattern distribution. This experiment establishes that the routing mechanism works. Capacity planning is a separate problem.

What Most Teams Get Wrong

Most inference optimisation work focuses on the supply side — faster hardware, more efficient models, better batching. This is necessary but insufficient. The demand side — how requests are shaped and routed before they reach the GPU — is where the real leverage lives at scale.

The dominant optimisation lever depends on where your system actually sits. Before reaching for more hardware, diagnose the bottleneck:

Scenario	Primary Bottleneck	Highest-Leverage Fix
Low cache hit rate (<50%)	Prefill recomputation	Routing + request shaping
High cache hit, low utilization	Scheduling inefficiency	Continuous batching config
High utilization, rising latency	Memory bandwidth	Better hardware / parallelism
High working set (many tenants)	KV cache capacity	Tiered cache / pod sharding

Most teams misdiagnose their bottleneck and optimise the wrong layer. A team with a 30% cache hit rate buying faster GPUs is solving the wrong problem — they’ll run twice as fast through twice as much unnecessary prefill work. The 81% cache hit rate in these experiments means the system is doing roughly 5× less prefill than a naive round-robin deployment on identical hardware. No GPU upgrade achieves that ratio. Routing does.

The practical entry point: instrument your cache hit rate first. If it’s below 50%, the fix is upstream — consistent system prompts, session-affinity routing, and a scheduler that knows about prefix hashes. Hardware comes after you’ve exhausted the routing lever.

A non-obvious failure mode: high aggregate cache hit rate masking per-tenant unfairness.

A subtle failure appears when cache hit rate is high but unevenly distributed across tenants. If a small number of tenants dominate traffic, their KV blocks stay hot while long-tail tenants constantly miss the cache. The system reports a healthy aggregate hit rate — 80%, 81% — but tail latency degrades because cold tenants always pay full prefill cost.

This leads to a misleading conclusion: “the cache is working.” In reality, the system is biased toward high-frequency tenants. The aggregate metric looks healthy precisely because the popular tenants pull it up — while every new or low-frequency tenant experiences the system as if caching doesn’t exist.

In these experiments, three tenant profiles at similar frequencies produced clean aggregate numbers. Production systems with hundreds of tenants and a power-law access distribution will not. Fixing uneven cache efficiency requires either per-tenant cache accounting to surface the distribution, or explicit isolation of long-tail traffic to prevent it from competing with high-frequency prefixes for the same KV blocks. Without this, cache efficiency improves averages while silently degrading fairness — which is the kind of problem that shows up in p99 SLA breaches, not in dashboard summaries.

Connecting the Dots Across This Series

This series started on an M4 Mac Mini with a 400MB model and a Python script measuring TTFT. It’s worth pausing to trace what the numbers across all three posts are actually saying — because they’re saying the same thing at different scales.

Post 1 established the theory: decode is memory-bandwidth bound. Prefill competes with decode for the same resources. KV cache management is the central architectural problem in inference. These aren’t vLLM-specific observations — they’re properties of the transformer architecture itself. They hold on any hardware, any framework.

Post 2 ran two experiments on identical hardware. Ollama vs vLLM on the same M4 chip, same unified memory, same model family. The result — 14,062ms vs 6,543ms p50, 2.15× faster — had nothing to do with the hardware. Ollama processes requests sequentially. vLLM uses continuous batching. Same silicon, 2× throughput difference from a software architecture decision. That result matters because it isolates the variable: serving architecture, not hardware.

Then under load, even vLLM on the Mac Mini hit the ceiling. Long prompts drove TTFT to 31 seconds at p50. The Mac Mini’s 16GB unified memory pool — shared between model weights, KV cache, and OS — ran out of headroom. The principle from Post 1 materialised as a real number: when the KV cache competes for the same memory as the model weights, concurrency suffers.

This post added two more variables: real GPU hardware and intelligent routing. The GH200 with NVLink-C2C at 900 GB/s changes the memory economics — the GPU can address 480GB at viable bandwidth, not just 16GB. EPP prefix cache routing adds the third variable: the system avoids prefill work entirely for 81% of requests by keeping KV blocks warm.

The result is 260ms p50 and 34ms P99 at 32 req/s. But attributing that entirely to the GH200 would be wrong — and that’s the point. The Ollama experiment on the Mac Mini already proved that hardware alone doesn’t determine the outcome. The GH200 sets a much higher ceiling. EPP routing determines how close you get to it.

The through-line: hardware sets the ceiling. Serving architecture determines how close you get to it. This has been true at every scale in this series — a $0 Mac Mini, a $2.29/hr GH200, and everything in between. The engineers who understand this spend their optimisation budget on routing and request shaping first, and on hardware second. The engineers who don’t buy more GPUs and wonder why the numbers don’t improve proportionally.

What Comes Next

The prefill vs decode time panel in this post showed two lines on the same chart — prefill varying with cache state, decode flat throughout. On this single pod they share a GPU. A prefill spike for a long-prompt request steals compute from ongoing decode sequences. The system handles it at this concurrency. It wouldn’t at 10×.

The next post deploys the P/D disaggregation guide on a second GH200 instance — separate prefill and decode pods, NIXL sidecar for KV cache transfer between them. The setup, the results, and the honest account of what P/D disaggregation actually requires in terms of hardware are all in one post.

The question it answers: given the baseline established here — 81.1% cache hit rate, 260ms p50, 34ms P99 — does separating prefill and decode onto dedicated pods move those numbers, or have we already captured most of the available gain on a single aggregated pod? The answer is more nuanced than either “yes it’s better” or “no it’s not worth it” — and the hardware constraint that makes it nuanced is one the GH200’s architecture makes visible in a way discrete GPU systems don’t.

Experiments run on Lambda Labs GH200 480GB, llm-d v0.4.0, Qwen3-0.6B, vllm/vllm-openai:latest. Platform engineer with 11+ years in distributed systems going deep on LLM serving infrastructure.

GitHub · LinkedIn

Deploying llm-d on a Cloud GPU — The 10 Things Nobody Tells You

2026-04-17T00:00:00+00:00

Let me set expectations before we start.

I had done the Mac Mini experiments. I had real benchmark numbers. I understood the theory. I knew what prefill and decode were, I could explain PagedAttention at a whiteboard, and I had wired up Prometheus and Grafana from scratch without breaking anything important.

Then I tried to deploy llm-d on a cloud GPU and spent the better part of a weekend staring at permission denied, No such file or directory, image pull failed, and other messages that are technically informative and emotionally frustrating. I’ll be candid — I walked away from the terminal twice while the instance meter kept running. GPU rental has a way of focusing the mind.

This post is the deployment war story. Every broken thing, in roughly the order it broke. I am not sugarcoating my experiences or providing you a polished “here are the steps”. Those promises exist in the official docs that assumes a level of environmental cooperation or prior experiences that was not there. What the docs don’t tell you is what I’m writing here.

If you’re trying to deploy llm-d yourself, this post will save you several hours and possibly your sanity. If you’re reading this for entertainment, welcome — the GH200 and I had quite a journey.

What Is llm-d? (One Diagram, Then We Move On)

llm-d is a Kubernetes-native inference scheduling layer that sits on top of vLLM. It doesn’t replace vLLM — vLLM still runs inside each pod doing exactly what it always does. What llm-d adds is an EPP (Endpoint Picker Plugin) — a custom Kubernetes scheduler that routes each incoming request to the right vLLM pod based on KV cache state, queue depth, and prefix cache hit probability.

The pitch: instead of dumb round-robin load balancing, llm-d routes your request to the decode pod that already has your system prompt cached. Lower TTFT, better GPU utilization, independently scalable prefill and decode pools.

The official architecture diagram shows the contrast clearly:

Left: legacy round-robin — no prefix cache awareness, no prefill/decode split, generic QPS autoscaling. Right: llm-d architecture — EPP prefix-cache-aware routing, dedicated prefill and decode pools connected via RDMA, tiered KV cache, SLO-aware autoscaling of each pool independently. Source: llm-d.ai/docs/architecture

That’s what you’re deploying. Now let’s talk about what it takes to actually run it.

The Hardware

Lambda Labs GH200, single instance, Grace Hopper Superchip.

GPU:          NVIDIA GH200 480GB (unified CPU+GPU memory)
Architecture: ARM64 (aarch64)
OS:           Ubuntu 22.04 LTS
Storage:      1.4TB local SSD

The ARM64 part is important — it comes up multiple times in this post in ways that will make you want to scream. Lambda’s GH200 instances come pre-installed with K3s, GPU drivers, and a collection of tools that seem helpful until they quietly conflict with everything you’re trying to do.

An architectural overview of our llm-d deployment on a single GH200 node. This diagram illustrates the complete request lifecycle—from the Istio ingress gateway through our custom Edge Processing Proxy (EPP) with intelligent scoring—down to the vLLM inference engine and our observability stack.

Before You Even Start: Getting the Instance

There is a step zero that nobody writes about: actually getting the GPU.

Cloud GPU availability — especially for high-end hardware like the GH200 — is genuinely constrained. Lambda Labs operates on a first-come, first-served basis for on-demand instances. When you log in and open the Launch Instance dialog, you will see a list of available GPU types. Some will be available. Some will say “Out of capacity.” The GH200 in particular comes and goes.

Lambda Labs instance selector. The 1x GH200 (96GB) with ARM64 + H100 at $2.29/hr was the instance used for these experiments. Note the B200 instances showing "Out of capacity" — high-end GPU availability is genuinely constrained and changes throughout the day. The GH200 is listed as 96GB here but the actual unified memory pool is 480GB — the 96GB refers to the HBM portion.

Getting set up on Lambda Labs:

Creating an account takes about 5 minutes. Go to lambda.ai, sign up, add a payment method, and add your SSH public key under SSH Keys in the dashboard. That last step is easy to forget and will stop you from connecting to any instance you launch.

# Your SSH public key — paste this into Lambda Labs SSH Keys dashboard
cat ~/.ssh/id_.pub

Once your account is ready, the practical strategy for getting a GH200 on-demand is:

Check availability in the morning — instances turn over as other users terminate their sessions overnight
Have your SSH key already added — you want to be able to launch immediately when a slot opens
Don’t launch and walk away — you’re paying per hour, so have your setup commands ready to run
Set a budget alert — Lambda Labs doesn’t do this automatically; track your usage manually

What these experiments actually cost:

I ran two separate sessions, on two separate GH200 instances. Being honest about the numbers:

Session	What ran	Cost
Instance `192.222.50.71`	Experiments 1–4 (inference-scheduling, EPP routing, Locust load tests)	$17.47
Instance `192.222.57.186`	Experiments 5–6 (P/D disaggregation, prefill + decode pods, sustained Locust run)	$18.31
Total		$35.78

The second session was slightly more expensive because the P/D disaggregation setup took longer to get right — more iteration time on a running instance. The first session was faster in wall-clock time but I was also slower at debugging, which explains the comparable cost.

For context: $35.78 for two full days of hands-on GPU infrastructure experiments is genuinely reasonable. A single A100 hour on AWS is $3.50+. Lambda Labs on-demand pricing is competitive precisely because availability isn’t guaranteed — you trade reliability for cost.

One important note for LLMOps Researchers: I’m running these experiments to gain “production like experience”, which means I’m deliberate about GPU spend. On-demand instances that you terminate when done are the right strategy here — avoid reserved instances or always-on setups until you have a specific recurring workload that justifies the commitment.

Gotcha 1: The Kubeconfig Belongs to Root

First thing you do when you SSH into a new Kubernetes node: check the cluster is running.

kubectl get namespaces

What you get instead:

WARN[0000] Unable to read /etc/rancher/k3s/k3s.yaml, please start server
with --write-kubeconfig-mode or --write-kubeconfig-group to modify kube
config permissions
error: error loading config file "/etc/rancher/k3s/k3s.yaml":
open /etc/rancher/k3s/k3s.yaml: permission denied

K3s installs its kubeconfig at /etc/rancher/k3s/k3s.yaml and owns it as root. Your Ubuntu user is not root. Nobody tells you this in the getting-started guide because it seems like a detail, and it is — right until it silently breaks every single kubectl and helm command you run for the next two hours.

The fix: copy the config to your home directory and set the environment variable.

mkdir -p $HOME/.kube
sudo cp /etc/rancher/k3s/k3s.yaml $HOME/.kube/config
sudo chown ubuntu:ubuntu $HOME/.kube/config
export KUBECONFIG=$HOME/.kube/config

# Make it permanent — critical for survival across sessions
echo 'export KUBECONFIG=$HOME/.kube/config' >> ~/.bashrc
source ~/.bashrc

Now kubectl get nodes works. You feel a small surge of optimism. Cherish it.

Gotcha 2: Lambda Ships Snap Helm and It Is Not Your Friend

which helm
# /snap/bin/helm

ls -lla /snap/bin/helm
# lrwxrwxrwx 1 root root 13 → /usr/bin/snap

Lambda Stack pre-installs Helm via snap. Snap packages run in a sandbox with PATH and permission constraints that quietly break plugin installations and kubeconfig resolution. The llm-d deployment requires specific Helm plugins (helm-diff) and specific versions. The snap Helm and plugin system do not get along cleanly on ARM64.

The fix: remove snap Helm and install the official arm64 binary.

# Remove snap helm first
sudo snap remove helm

# Install official arm64 binary
wget https://get.helm.sh/helm-v3.19.0-linux-arm64.tar.gz
tar xzf helm-v3.19.0-linux-arm64.tar.gz
sudo mv linux-arm64/helm /usr/local/bin/helm
rm -rf linux-arm64 helm-v3.19.0-linux-arm64.tar.gz

# Verify
helm version
# version.BuildInfo{Version:"v3.19.0"...}

# Export and persist
export HELM_BIN=/usr/local/bin/helm
echo 'export HELM_BIN=/usr/local/bin/helm' >> ~/.bashrc
source ~/.bashrc

Notice I added HELM_BIN to .bashrc. This is foreshadowing.

Gotcha 3: Environment Variables Die When You Close the Terminal

This one is deceptively simple and caused a disproportionate amount of grief.

llm-d’s helmfile commands use HELM_BIN to find the Helm binary, and take the namespace via -n $NAMESPACE. Both are environment variables. Both need to exist in every session.

SSH sessions do not carry your previous session’s exported variables. Every time you reconnect, those variables are gone — and the error messages when they’re missing are spectacularly unhelpful:

# What happens when NAMESPACE is empty:
helmfile apply -n ${NAMESPACE}
# flag needs an argument: 'n' in -n

That message doesn’t say “your environment variable is empty.” It sends you hunting through Helm documentation for a flag you’ve never seen, before you eventually notice the problem.

The fix: put everything in ~/.bashrc and verify at the start of every session.

echo 'export KUBECONFIG=$HOME/.kube/config' >> ~/.bashrc
echo 'export HELM_BIN=/usr/local/bin/helm' >> ~/.bashrc
echo 'export NAMESPACE=llm-d' >> ~/.bashrc
source ~/.bashrc

# First command of every session:
echo $KUBECONFIG $HELM_BIN $NAMESPACE
# /home/ubuntu/.kube/config /usr/local/bin/helm llm-d

If those three don’t print correctly, nothing downstream will work.

Gotcha 4: The Default values.yaml Will Crash Your GPU

When you clone the llm-d repo and open the inference-scheduling guide’s values.yaml, the default looks like this:

modelArtifacts:
  uri: "hf://Qwen/Qwen3-32B/tensor"
  name: "Qwen/Qwen3-32B"
  size: 80Gi

Qwen3-32B. Eighty gigabytes. On a single GPU.

Here is why this matters. GPU memory is not a bottomless pool — it gets divided between everything vLLM needs to run:

┌─────────────────────────────────────────────────────────┐
│              GH200 — 480GB Unified Memory               │
├──────────────────────┬──────────────────────────────────┤
│  Model Weights       │  Qwen3-32B FP16 ≈ 65GB           │
│  (static, loaded     │  Qwen3-0.6B 4-bit ≈ 0.4GB        │
│   once at startup)   │                                  │
├──────────────────────┼──────────────────────────────────┤
│  KV Cache            │  Grows per token, per request    │
│  (dynamic, grows     │  Fills whatever is left over     │
│   with context)      │                                  │
├──────────────────────┼──────────────────────────────────┤
│  CUDA Graphs,        │  vLLM pre-allocates ~10–15%      │
│  Activations,        │  for warmup and graph capture    │
│  Overhead            │                                  │
└──────────────────────┴──────────────────────────────────┘

GPU memory utilization breakdown on a 480GB node: Qwen3-32B vs. Qwen3-0.6B. This comparison highlights the memory pressure caused by large model weights (65GB) versus the massive KV cache headroom unlocked by using a highly optimized, smaller model.

With Qwen3-32B, model weights alone consume 65GB. vLLM then pre-allocates KV cache for the maximum sequence length on top. Under concurrent load you are pushing the GPU extremely hard before a single user request arrives — and you’ve also got 20+ minutes of model download and warmup before you can even test anything.

With Qwen3-0.6B (~400MB), the model loads in under 2 minutes and 99% of the GPU is available for KV cache and experiments. That’s the version to start with.

The fix: replace the values file entirely. The full working values.yaml is below — note two critical changes from the default: model is Qwen3-0.6B, and the image is changed (explained in the next gotcha).

# ms-inference-scheduling/values.yaml — working version
multinode: false

modelArtifacts:
  uri: "hf://Qwen/Qwen3-0.6B"
  name: "Qwen/Qwen3-0.6B"
  size: 2Gi
  authSecretName: "llm-d-hf-token"
  labels:
    llm-d.ai/inference-serving: "true"
    llm-d.ai/guide: "inference-scheduling"
    llm-d.ai/accelerator-variant: "gpu"
    llm-d.ai/accelerator-vendor: "nvidia"
    llm-d.ai/model: "Qwen3-0.6B"

routing:
  proxy:
    enabled: false
    targetPort: 8000

accelerator:
  type: nvidia

decode:
  create: true
  parallelism:
    tensor: 1
    data: 1
  replicas: 1
  monitoring:
    podmonitor:
      enabled: true
      portName: "vllm"
      path: "/metrics"
      interval: "30s"
  containers:
    - name: "vllm"
      image: vllm/vllm-openai:latest   # ← critical change, see next gotcha
      modelCommand: vllmServe
      args:
        - "--disable-uvicorn-access-log"
        - "--gpu-memory-utilization=0.90"
      ports:
        - containerPort: 8000
          name: vllm
          protocol: TCP
      resources:
        limits:
          cpu: '16'
          memory: 60Gi
        requests:
          cpu: '8'
          memory: 30Gi
      mountModelVolume: true
      volumeMounts:
        - name: metrics-volume
          mountPath: /.config
        - name: shm
          mountPath: /dev/shm
        - name: torch-compile-cache
          mountPath: /.cache
      startupProbe:
        httpGet:
          path: /v1/models
          port: vllm
        initialDelaySeconds: 15
        periodSeconds: 30
        timeoutSeconds: 5
        failureThreshold: 120
      livenessProbe:
        httpGet:
          path: /health
          port: vllm
        periodSeconds: 10
        timeoutSeconds: 5
        failureThreshold: 3
      readinessProbe:
        httpGet:
          path: /v1/models
          port: vllm
        periodSeconds: 5
        timeoutSeconds: 2
        failureThreshold: 3
  volumes:
    - name: metrics-volume
      emptyDir: {}
    - name: torch-compile-cache
      emptyDir: {}
    - name: shm
      emptyDir:
        medium: Memory
        sizeLimit: 20Gi

prefill:
  create: false

Start small. Prove the stack works. Scale the model up later.

Gotcha 5: `llm-d-cuda` Is x86-Only

The default values.yaml uses ghcr.io/llm-d/llm-d-cuda:v0.6.0 as the vLLM container image. This image was built for x86_64 (amd64). The GH200 is ARM64 (aarch64).

When Kubernetes tries to run an x86 image on an ARM64 node, it doesn’t fail with “wrong architecture.” It fails with a Triton compiler error deep inside the container startup, several minutes after the pod appears to be Running:

Triton compilation failed: RuntimeError: ...
CUDA error: device-side assert triggered

The pod crashes. CrashLoopBackOff. kubectl describe pod shows the image pulled successfully. The logs look like GPU memory issues. You spend time adjusting --gpu-memory-utilization and redeploying — none of which helps, because the problem is binary architecture, not memory configuration.

The fix: vllm/vllm-openai:latest is multi-arch and includes a proper ARM64 build.

# Change this:
image: ghcr.io/llm-d/llm-d-cuda:v0.6.0

# To this:
image: vllm/vllm-openai:latest

One line. Saves two hours.

Gotcha 6: Gateway API CRDs Don’t Come With the Cluster

llm-d uses the Kubernetes Gateway API — HTTPRoute and Gateway custom resources. These are not part of standard Kubernetes and do not come pre-installed with K3s.

When helmfile apply runs without these CRDs, it fails with unknown resource type errors. If you’re not familiar with the Gateway API, you’ll look at your Istio installation instead of at the missing CRDs.

The fix: install both sets of CRDs before anything else.

# Gateway API CRDs
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.1/standard-install.yaml

# llm-d InferencePool CRDs
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v0.3.0/manifests.yaml

# Verify both
kubectl get crd | grep -E "gateway|inference"

Ten seconds to apply. Saves a debugging session.

Gotcha 7: Use `helm template | kubectl apply` — K8s Debugging 101

helmfile apply is the documented deployment path. It’s also the one that, on K3s ARM64, would silently partially deploy — some resources applied, others skipped, exit code 0.

This is actually a universal Helm debugging pattern worth internalising regardless of llm-d: render the chart to YAML first, then apply. Helmfile adds an abstraction layer that can obscure what’s actually being sent to the API server. helm template removes that layer and gives you full visibility.

# Step 1: Render to YAML and inspect — no cluster changes
$HELM_BIN template ms-inference-scheduling \
  llm-d-modelservice/llm-d-modelservice \
  --namespace llm-d \
  --values ms-inference-scheduling/values.yaml \
  | less

# Step 2: Dry-run — validates against cluster API, shows what would change
$HELM_BIN template ms-inference-scheduling \
  llm-d-modelservice/llm-d-modelservice \
  --namespace llm-d \
  --values ms-inference-scheduling/values.yaml \
  | kubectl apply -n llm-d -f - --dry-run=client

# Step 3: Apply for real
$HELM_BIN template ms-inference-scheduling \
  llm-d-modelservice/llm-d-modelservice \
  --namespace llm-d \
  --values ms-inference-scheduling/values.yaml \
  | kubectl apply -n llm-d -f -

The --dry-run=client step is the one most people skip and most people regret skipping. It validates your YAML against the cluster’s API schemas, catches missing CRDs, and shows exactly what would be created or updated — before anything changes in the cluster. Whenever a Helm deployment behaves unexpectedly, this three-step render-validate-apply pattern is where to start debugging.

Gotcha 8: The HTTPRoute Is Not Applied by helmfile

After all pods are running, you port-forward the gateway and send a test request. You get nothing back.

The reason: the HTTPRoute — the rule that tells the gateway to forward traffic to the EPP — is not deployed by helmfile. It lives in a separate YAML file and must be applied manually.

# Check first:
kubectl get httproute -n llm-d
# No resources found in llm-d namespace.

# Apply:
kubectl apply -f ~/llm-d/llm-d/guides/inference-scheduling/httproute.yaml -n llm-d

# Verify:
kubectl get httproute -n llm-d
# NAME                         AGE
# llm-d-inference-scheduling   6s

Without this, the gateway has no routing rules. Every request returns empty or times out. No errors appear in pod logs because — from the gateway’s perspective — nothing is wrong. There is just no route configured.

Gotcha 9: Immutable Selector Labels Mean You Can’t Upgrade In-Place

If you change the model name in values.yaml after deploying, helm upgrade will fail:

Error: UPGRADE FAILED: cannot patch "ms-inference-scheduling-llm-d-modelservice-decode"
with kind Deployment: Deployment.apps is invalid:
spec.selector: Invalid value: ... field is immutable

Kubernetes Deployment selector labels are immutable. Changing the model name — which is part of the label selector — requires deleting the Deployment first.

kubectl delete deployment -n llm-d \
  ms-inference-scheduling-llm-d-modelservice-decode 2>/dev/null || true

$HELM_BIN template ms-inference-scheduling \
  llm-d-modelservice/llm-d-modelservice \
  --namespace llm-d \
  --values ms-inference-scheduling/values.yaml \
  | kubectl apply -n llm-d -f -

The || true prevents failure if the deployment doesn’t exist yet — useful for idempotent scripts.

Gotcha 10: Port-Forward Processes Die Silently and Don’t Tell You

Access to the llm-d gateway from your Mac goes through an SSH tunnel plus kubectl port-forward. When the port-forward dies — session timeout, network hiccup — the TCP port on your Mac stays bound. Next tunnel attempt:

Unable to listen on port 8080:
[unable to create listener: Error listen tcp4 127.0.0.1:8080: bind: address already in use]

The fix: kill the zombie first, then verify the fresh tunnel is alive before benchmarking.

# Kill whatever holds port 8080
lsof -ti :8080 | xargs kill -9 2>/dev/null || true

# Fresh tunnel
ssh -L 8080:localhost:8080 ubuntu@ \
  "KUBECONFIG=/home/ubuntu/.kube/config kubectl port-forward \
   -n llm-d svc/infra-inference-scheduling-inference-gateway-istio 8080:80"

# Verify before doing anything else
curl -s http://localhost:8080/v1/models | jq .data[0].id
# "Qwen/Qwen3-0.6B"  ← tunnel alive
# (nothing / timeout) ← restart the tunnel

Running Locust against a dead port-forward gives you 0 completed requests and makes you think the model is broken. Always verify first.

What Running Looks Like

After all of the above — correct kubeconfig, real Helm binary, environment variables in .bashrc, small model, right image, Gateway API CRDs installed, HTTPRoute applied, Deployment deleted and recreated when needed — this is what a healthy deployment looks like:

kubectl get pods -n llm-d

NAME                                                              READY   STATUS    RESTARTS   AGE
gaie-inference-scheduling-epp-584f797cc8-4gvw8                    1/1     Running   0          68m
infra-inference-scheduling-inference-gateway-istio-7c5546dr7kd2   1/1     Running   0          68m
ms-inference-scheduling-llm-d-modelservice-decode-57678587gqzlt   1/1     Running   0          10m

Three pods, all 1/1 Running. The decode pod may restart once while the model downloads — that’s normal. Zero restarts on EPP and gateway.

# Gateway is programmed and routing
kubectl get gateway -n llm-d
# infra-inference-scheduling-inference-gateway   istio   True   68m

# A request through the gateway returns a response
curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen/Qwen3-0.6B",
       "messages":[{"role":"user","content":"What is KV cache?"}],
       "max_tokens":50}' | jq .choices[0].message.content
# "\nOkay, so I need to figure out what KV cache is..."

That response, after everything above, feels like a small engineering miracle.

The Observability Stack

The kube-prometheus-stack installs Grafana with 33 dashboards. After a fresh deploy, confirm everything is healthy before running experiments:

Grafana Overview after successful llm-d deployment: 0 alerts firing, 33 dashboards loaded, API server responding. The two dashboards that matter for experiments: llm-d Performance Dashboard (TTFT, ITL, KV cache hit rate) and llm-d vLLM Overview (E2E latency, scheduler state, prefill vs decode time split).

If the llm-d dashboards show “No data”, check that the PodMonitor was created:

kubectl get podmonitor -n llm-d
# If missing, ensure values.yaml has monitoring.podmonitor.enabled: true

The Survival Checklist

First commands of every session — before touching anything else:

# 1. Environment variables present
echo $KUBECONFIG $HELM_BIN $NAMESPACE
# /home/ubuntu/.kube/config /usr/local/bin/helm llm-d

# 2. All pods running
kubectl get pods -n llm-d
# All 3 pods: 1/1 Running

# 3. HTTPRoute exists
kubectl get httproute -n llm-d

# 4. Port-forward alive
curl -s http://localhost:8080/v1/models | jq .data[0].id
# "Qwen/Qwen3-0.6B"

# 5. Grafana accessible
curl -s http://localhost:3000/api/health | jq .database
# "ok"

If any of these fail, fix it before proceeding.

The Full Deployment Sequence

For anyone who wants the complete working sequence without the narrative:

# ── Fix environment ─────────────────────────────────────────────────
mkdir -p $HOME/.kube
sudo cp /etc/rancher/k3s/k3s.yaml $HOME/.kube/config
sudo chown ubuntu:ubuntu $HOME/.kube/config
echo 'export KUBECONFIG=$HOME/.kube/config' >> ~/.bashrc
echo 'export HELM_BIN=/usr/local/bin/helm' >> ~/.bashrc
echo 'export NAMESPACE=llm-d' >> ~/.bashrc
source ~/.bashrc

# ── Install real Helm (ARM64) ───────────────────────────────────────
sudo snap remove helm
wget https://get.helm.sh/helm-v3.19.0-linux-arm64.tar.gz
tar xzf helm-v3.19.0-linux-arm64.tar.gz
sudo mv linux-arm64/helm /usr/local/bin/helm

# ── Install CRDs ────────────────────────────────────────────────────
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.1/standard-install.yaml
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v0.3.0/manifests.yaml

# ── Repos and HuggingFace secret ────────────────────────────────────
git clone https://github.com/llm-d/llm-d.git
$HELM_BIN repo add llm-d-modelservice \
  https://llm-d-incubation.github.io/llm-d-modelservice/
$HELM_BIN repo update

kubectl create namespace $NAMESPACE
kubectl create secret generic llm-d-hf-token \
  --from-literal="HF_TOKEN=${HF_TOKEN}" \
  --namespace "${NAMESPACE}" \
  --dry-run=client -o yaml | kubectl apply -f -

# ── Deploy infra (EPP + gateway) ────────────────────────────────────
cd ~/llm-d/llm-d/guides/inference-scheduling
HELM_BIN=$HELM_BIN helmfile apply -n ${NAMESPACE}

# ── Deploy modelservice (reliable pattern) ──────────────────────────
$HELM_BIN template ms-inference-scheduling \
  llm-d-modelservice/llm-d-modelservice \
  --namespace llm-d \
  --values ms-inference-scheduling/values.yaml \
  | kubectl apply -n llm-d -f -

# ── HTTPRoute (always manual) ───────────────────────────────────────
kubectl apply -f httproute.yaml -n llm-d

# ── Verify ──────────────────────────────────────────────────────────
kubectl get pods -n llm-d
curl -s http://localhost:8080/v1/models | jq .

Every step is here because I needed it.

What This Unlocks

With the stack running, the EPP is making routing decisions on every request — consulting prefix cache, queue depth, and KV cache utilization scorers on each incoming call. You just can’t see it yet with a single decode pod and no load.

The next post in this series covers what happens when you actually put the system under load — EPP prefix cache routing in action, KV cache hit rate climbing to 81.1% in Grafana, TTFT stabilising at 15ms p50 under sustained concurrent traffic, and the Locust results from a system that’s routing intelligently rather than guessing.

The deployment pain was worth it. The numbers make that clear.

Deployed on Lambda Labs GH200 480GB, K3s, llm-d v0.4.0, Qwen3-0.6B, vllm/vllm-openai:latest. Scripts will be made available soon via github repository. Platform engineer with 11+ years in distributed systems going deep on LLM serving infrastructure.

GitHub · LinkedIn

Treating the M4 Mac Mini Like a Production Inference Server (It Tried)

2026-04-16T00:00:00+00:00

I spent a week running inference experiments on my M4 Mac Mini — not to build a product, but to understand what actually happens inside an LLM serving stack. TTFT. TPOT. KV cache. Continuous batching. These are words I had read in papers and blog posts. This week I measured them. Here is what I found.

This post is the hands-on companion to Post 1. Same mental model — but now with real hardware, real Grafana dashboards, and numbers you can reproduce yourself.

Hardware: M4 Mac Mini, 16GB unified memory
Model: mlx-community/Qwen3-0.6B-4bit (~400MB, 4-bit quantized)
Inference engine: vllm-metal 0.13.0 (official vllm-project Apple Silicon plugin)
Observability: Prometheus + Grafana via Docker Compose
Load testing: vegeta + Locust
Gateway experiment: kind cluster + nginx reverse proxy

The full local setup: vllm-metal serving Qwen3-0.6B-4bit on M4 Mac Mini unified memory, Prometheus scraping /metrics every 5s, Grafana dashboards rendering live, Locust and vegeta generating load. The kind cluster with nginx is the K8s gateway experiment — simulating the Envoy position in a production llm-d deployment.

Why Apple Silicon for This Experiment?

The M4 Mac Mini has 16GB of unified memory — shared between CPU and GPU. This is both a strength and a constraint for inference.

Strength: Unified memory means the model weights aren’t copied between CPU RAM and a discrete GPU’s VRAM. Zero-copy tensor operations. For a small quantized model, this is genuinely fast.

Constraint: 16GB is 16GB. The model, the KV cache, the OS, and every other process share the same pool. Under concurrent load, you will find the ceiling.

The more interesting reason: I wanted to validate that the mental model from Post 1 — prefill is compute-bound, decode is memory-bound, KV cache is the critical resource — holds on real hardware, not just in theory. It does. With caveats.

The Setup

Why vllm-metal, Not Ollama?

Before any benchmarks: I used both, and the choice matters more than I expected.

vllm-metal is the official vllm-project Apple Silicon plugin. It runs Metal GPU kernels via MLX, exposes a full OpenAI-compatible API, and critically — emits Prometheus metrics out of the box. Same codebase as cloud vLLM, same API surface, same metric names. What you learn here transfers directly to a GPU cluster.

Ollama is simpler to install and great for single-user local use. But it doesn’t expose Prometheus metrics natively, and — as the benchmark section will show — it doesn’t implement continuous batching. Under concurrent load, that difference is not subtle.

One gotcha worth flagging: vllm-mlx (different from vllm-metal) is a third-party wrapper that was broken with mlx-lm>=0.31.0 as of April 2026. Use vllm-metal — the official plugin — and avoid that detour.

A note on scope: this post covers vLLM and Ollama only. TGI, TensorRT-LLM, ExLlamaV2, and SGLang are all on the list — but running meaningful benchmarks against each requires dedicated GPU time, and GPU time costs money I’m not spending during a job search. I’ll get to them when the experiments justify the cost. For now, the vLLM vs Ollama comparison is grounded in real numbers on real hardware, and that’s the comparison worth making.

Observability Stack

Prometheus and Grafana running via Docker Compose, with host.docker.internal resolving to the Mac’s loopback so the containers can scrape vLLM’s metrics endpoint:

# docker-compose.yml
services:
  prometheus:
    image: prom/prometheus:latest
    extra_hosts:
      - "host.docker.internal:host-gateway"
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin

# prometheus.yml
global:
  scrape_interval: 5s

scrape_configs:
  - job_name: vllm
    static_configs:
      - targets:
          - 'host.docker.internal:8000'

Metric name gotcha: the official vLLM Grafana dashboard references gpu_cache_usage_perc but vllm-metal exposes vllm:kv_cache_usage_perc. If your KV cache panel shows “No data”, confirm the correct metric name directly:

curl localhost:8000/metrics | grep cache

The kind Cluster and nginx Gateway

Why this matters: In production with llm-d, requests flow through an Envoy gateway pod before reaching vLLM pods. I wanted to replicate that topology locally — understand the gateway layer before Week 2 introduced it with real routing logic.

On Apple Silicon, Metal GPU cannot be passed into Docker containers. So vllm-metal has to run natively on the Mac host. The resulting topology is deliberately artificial:

── LOCAL SETUP (Apple Silicon constraint) ───────────────────────────

curl localhost:9000
     │
     ▼  host port 9000 → kind NodePort 30000
nginx pod :80        ← simulates Envoy gateway position in llm-d
     │
     ▼  proxy_pass http://172.19.0.1:8000
vllm-metal           ← running natively on Mac host (Metal GPU)

── PRODUCTION (Week 2 — real GPU) ──────────────────────────────────

curl gateway:80
     │
     ▼
Envoy gateway pod    ← EPP does KV-cache-aware routing here
     │
     ▼
vLLM pod             ← FastAPI server + GPU inside the pod

Apple Silicon gotcha: host.docker.internal inside kind on M4 Mac resolves to IPv6, but vllm-metal only binds to IPv4. The nginx proxy_pass fails silently — you’ll see connect() failed (101: Network unreachable) in the logs. Fix: get the actual bridge gateway IP:

HOST_IP=$(docker inspect vllm-lab-control-plane \
  --format '')
echo $HOST_IP
# 172.19.0.1  ← use this in proxy_pass, not host.docker.internal

# kind-config.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
    extraPortMappings:
      - containerPort: 30000
        hostPort: 9000

Left: local Apple Silicon setup — nginx in kind proxying to vllm-metal on the Mac host. Right: production llm-d topology — Envoy gateway routing to vLLM pods with direct GPU access. The gateway position is identical; only the backend location differs.

Experiment 1: Baseline — TTFT and TPOT Directly Measured

First question: what does a single uncontested request actually cost on this hardware?

I wrote a streaming latency script that timestamps the first chunk arrival (TTFT) then tracks per-token intervals (TPOT):

def measure_streaming(prompt, max_tokens=100):
    start = time.perf_counter()
    first_token_time = None
    token_times = []

    with client.stream("POST", "/v1/chat/completions", json=payload) as resp:
        for line in resp.iter_lines():
            if content := parse_token(line):
                now = time.perf_counter()
                if first_token_time is None:
                    first_token_time = now  # ← TTFT captured here
                token_times.append(now)

    # TPOT = average gap between consecutive tokens
    intervals = [token_times[i] - token_times[i-1]
                 for i in range(1, len(token_times))]
    avg_tpot = sum(intervals) / len(intervals)

Results:

Metric	Short prompt (~20 tokens)	Long prompt (~500 tokens)
TTFT	341ms	487ms (+42%)
TPOT	3.4ms	4.2ms (barely changed)
Throughput	145.8 tok/s	110.2 tok/s

The asymmetry is the point. TTFT increased 42% when the prompt got 25× longer — more tokens to process in parallel during prefill. But TPOT barely moved (3.4ms → 4.2ms) because decode generates one token at a time regardless of prompt length. Once the model starts generating, the input is already in the KV cache and irrelevant to decode speed.

TPOT is determined by model size and hardware bandwidth. TTFT is determined by how long your user made their prompt.

Experiment 2: Prefix Cache — 51% TTFT Reduction for Free

KV cache stores computed key-value tensors for each token so they don’t need recomputing on subsequent steps. Prefix caching extends this across requests: if two requests share the same system prompt, the second reuses the KV cache blocks from the first.

I tested this with a long shared system prompt (~50 repetitions, creating ~200 tokens of identical prefix):

SYSTEM_PROMPT = "You are a helpful assistant. " * 50  # long shared prefix

chat("What is 2+2?",   "First request (cold cache)")
chat("What is 3+3?",   "Second request (warm cache)")
chat("What is 4+4?",   "Third request (warmer cache)")

Results:

Request	TTFT	Notes
First (cold cache)	753ms	Full prefill — all tokens computed
Second (warm cache)	367ms	51% reduction — prefix blocks reused
Third (warmer cache)	364ms	Marginal further improvement
Session cache hit ratio	17.7%	Across the full test session

51% TTFT reduction. Zero model changes. No infrastructure changes. Just sending the same system prompt byte-for-byte across requests.

The operational implication: your production system prompt should be at the front of every request, identical every time. Anything that mutates it per-request — timestamp injection, per-user personalization in the system prompt, A/B testing different prompts — kills the cache hit rate and silently taxes every user’s TTFT.

This is also the per-request proof of what llm-d’s EPP prefix-cache scorer does at cluster scale: route requests to the decode pod that already holds relevant KV cache blocks. What I measured locally as a 51% reduction is what the EPP maximises across dozens of pods.

Experiment 3: KV Cache Pressure Under Concurrent Load

I fired 4 long concurrent requests simultaneously — each asking for 500 output tokens — to observe how continuous batching handled them:

# kv_pressure.py
def send_long_request(i):
    r = httpx.post("http://localhost:8000/v1/chat/completions",
        json={"model": MODEL,
              "messages": [{"role": "user",
                           "content": f"Write a very long essay about distributed systems topic {i}. Be extremely verbose and detailed."}],
              "max_tokens": 500},
        timeout=120)
    tokens = r.json()["usage"]["completion_tokens"]
    print(f"Request {i}: {tokens} tokens in {elapsed:.1f}s ({tokens/elapsed:.1f} tok/s)")

with concurrent.futures.ThreadPoolExecutor(max_workers=4) as ex:
    futures = [ex.submit(send_long_request, i) for i in range(4)]

Results:

Request 0: 500 tokens in 24.9s (20.1 tok/s)
Request 1: 500 tokens in 24.9s (20.1 tok/s)
Request 2: 500 tokens in 24.9s (20.1 tok/s)
Request 3: 500 tokens in 24.9s (20.1 tok/s)

All four completed simultaneously — continuous batching working correctly. In a naive sequential server, 4 requests of this size would take roughly 4× the single-request time. Instead, vLLM batched decode steps across all four active requests in a single GPU pass per iteration.

During this test, vllm:kv_cache_usage_perc climbed to 4% in Grafana and returned to baseline when all requests completed. On a 0.6B model with these short prompts, plenty of headroom. The same pattern on a larger model pushes KV cache toward the 85% danger threshold where evictions begin and latency spikes.

Experiment 4: Locust Mixed Traffic — Where the M4 Hits Its Ceiling

Real traffic isn’t uniform. I configured Locust to simulate a 3:1 mix of short chatbot-style prompts and long document-summarization prompts:

class InferenceUser(HttpUser):
    wait_time = between(0.5, 2)

    @task(3)
    def short_request(self):   # 75% of traffic — "What is 2+2?" etc, max_tokens=50

    @task(1)
    def long_request(self):    # 25% of traffic — "Explain transformer attention...", max_tokens=200

Results at 5 concurrent users:

Task	avg	min	p50	max
short_prompt	9,471ms	710ms	1,900ms	23,219ms
long_prompt	16,830ms	3,152ms	31,000ms	30,508ms

Top-left: Token Generation — spike when the Locust test starts, trailing off as the M4 saturates under mixed load. Top-right: Request Generation Length heatmap — two clear clusters (short bottom, long top) confirming the 3:1 traffic mix is exercised. Middle-left: E2E Request Latency p50/p95/p99 — P99 reaches ~1 minute for long prompts. P50 looks acceptable but hides the long-tail pain. Middle-right: Queue Time — spikes to 0.25s during the burst, showing requests queuing before prefill even starts. Bottom-left: Inter Token Latency — stays flat at 5–10ms throughout. Decode is not the bottleneck. It is prefill queuing that is killing TTFT. Bottom-right: Max Generation Token in Sequence Group — peaks at ~100 tokens, showing the batch composition during the test.

The long prompt p50 of 31 seconds is not a bug — it’s the fundamental prefill/decode competition at scale. Long prompts trigger expensive prefill operations that block decode steps for all concurrent requests. Short-prompt users feel it as TTFT spikes. Long-prompt users wait in a growing queue.

Notice that Inter Token Latency stays flat throughout. Once a request gets GPU time for decode, it’s fine. The problem is always getting to the front of the queue. This is exactly the problem P/D disaggregation solves — dedicate separate GPUs to prefill so long prompts never preempt decode.

Experiment 5: vLLM vs Ollama — The Head-to-Head

I benchmarked vllm-metal against Ollama using vegeta at a sustained 3 req/s over 30 seconds. The primary comparison is Ollama qwen2.5:0.5b vs vllm-metal Qwen3-0.6B-4bit — same model family, same approximate parameter count, different serving engines. Mistral 7B is included as a reference only.

vLLM-metal vs Ollama at sustained 3 req/s. Same hardware, comparable model sizes. The gap is continuous batching, not raw compute.

Engine	Model	p50	p95	Min	Success
Ollama	mistral 7B (reference — 10× larger)	15,979ms	26,896ms	5,157ms	100%
Ollama	qwen2.5:0.5b (primary comparison)	14,062ms	20,012ms	2,569ms	100%
Ollama	qwen3.5 (large)	timeout	timeout	13,039ms	2.2%
vllm-metal	Qwen3-0.6B-4bit (primary comparison)	6,543ms	10,952ms	974ms	100%

vLLM at p50: 2.15× faster. At p95: 1.83× faster.

The minimum latency of 974ms — sub-second — is the clearest signal: when the server isn’t saturated, continuous batching delivers a first response before Ollama has even started processing. Ollama’s 2,569ms minimum reflects its sequential model — each request waits for the previous one to complete before the GPU is available.

The qwen3.5 failure (2.2% success) is instructive. A larger model at 3 req/s causes Ollama’s queue to back up until clients time out. vllm-metal handles the same rate at 100% success because it batches multiple requests into each GPU forward pass rather than serialising them.

Top-left: gpu_cache_usage_perc shows "No data" — this is the wrong metric name for vllm-metal. Use vllm:kv_cache_usage_perc instead. Top-right: Token Throughput — spikes to ~80 tok/s during the load test, confirming concurrent batching is active. Middle-left: Requests Waiting vs Running — spikes at 14:45 are the concurrent requests being admitted. Queue depth briefly reaches 5 before draining. Middle-right: Scheduler State — Num Running peaks at 4–5 simultaneously, all requests making progress at once. Bottom-left: TTFT over 2m — climbs during load, recovers as the queue clears. The correlation between queue depth and TTFT is direct. Bottom-right: Cache Utilisation — peaks at ~4% during load, returns to baseline after. Plenty of KV cache headroom on a 0.6B model.

Experiment 6: The nginx K8s Gateway — What the Proxy Layer Actually Costs

With the kind cluster running and nginx proxying to vllm-metal, I ran inference requests through the full gateway path. I used max_tokens=20 — tiny inference — so the measured latency is mostly proxy traversal overhead:

vegeta attack -rate=3/s -duration=15s \
  -targets=<(echo "POST http://localhost:9000/v1/chat/completions
Content-Type: application/json
@/tmp/vllm_gateway.json") | vegeta report

Results (45 requests, 3 req/s, max_tokens=20):

Metric	Value	Notes
p50 latency	232ms	localhost:9000 → kind → nginx → vllm-metal
p95 latency	277ms
Min latency	163ms	Mostly proxy traversal + tiny inference
Success rate	100%	45/45 — gateway adds no failures

The 163ms minimum reflects proxy traversal cost, not inference time — max_tokens=20 on a 0.6B model generates tokens in tens of milliseconds. The meaningful result is 100% success at sustained rate. The gateway adds overhead but is not a bottleneck and does not drop requests.

In production, Envoy adds single-digit milliseconds — it’s purpose-built for high-throughput proxying. The nginx simulation here adds more, but the structural lesson holds: a gateway layer in front of vLLM does not meaningfully affect inference latency. What matters in llm-d is the EPP’s routing intelligence — prefix cache scoring, queue depth scoring, KV utilisation scoring — not the proxy overhead itself.

What Grafana Showed — The Full Picture

The dashboard below captures the most informative view — TTFT latency percentiles, the prefill vs decode time split, finish reasons, and prompt length distribution:

Top-left: Time To First Token Latency — P99 spikes to ~7s under load while P50 stays under 2s. The gap between P50 and P99 is the prefill queue effect. Top-right: Request Prompt Length heatmap — two clusters confirm short vs long prompt traffic mix is being exercised correctly. Bottom-left: Finish Reason — "length" dominates, meaning requests hit max_tokens as expected. No unexpected aborts or errors. Bottom-right (highlighted): Requests Prefill and Decode Time — green is prefill, yellow is decode. Prefill varies with prompt length; decode stays flat. This is the visual proof of the two-phase separation. In a P/D disaggregated deployment, these two lines come from separate pod pools.

Three takeaways from Grafana that weren’t obvious before running the experiments:

TPOT is a red herring under load. Inter Token Latency stayed flat throughout every experiment — even when E2E latency climbed to 31 seconds for long prompts. The per-token decode speed is stable. The problem is always pre-decode queuing.

The prefill/decode time split is visible and separable. Prefill varies with prompt length, decode stays constant. In a disaggregated setup each line would come from a different pool of pods, independently scalable.

KV cache utilisation is your early warning system. Peak of 4% during these experiments — plenty of headroom for a 0.6B model. On a larger model or busier system, the moment this crosses 85% is your fire alarm.

The Connection to Week 2

Everything measured this week points to the same structural problem: in aggregated serving, prefill and decode compete for the same resources. Long prompts delay short ones. High-concurrency workloads cause cascading TTFT degradation that no amount of hardware scaling can fully fix — because the problem is architectural, not computational.

── AGGREGATED SERVING (what Week 1 showed) ──────────────────────────

Request A (long prompt)  → [  prefill 487ms  ][  decode decode decode ...  ]
Request B (short prompt) → [WAITING...       ][  prefill 341ms  ][ decode  ]
Request C (short prompt) → [WAITING.....................][prefill ][ decode  ]

── P/D DISAGGREGATION (what Week 2 will fix) ────────────────────────

Prefill pool: [ A-prefill ][ B-prefill ][ C-prefill ]  ← compute-bound
Decode pool:  [ A-decode  ][ B-decode  ][ C-decode  ]  ← memory-bandwidth-bound

Decode pool never blocks on prefill. TTFT stays consistent under load.

The numbers from this week — TTFT 341ms vs 487ms, prefix cache 51% reduction, 31-second long-prompt p50 under load — are the baseline. Week 2 is the comparison.

The Complete Benchmark Reference

Hardware: M4 Mac Mini, 16GB unified memory
Model:    mlx-community/Qwen3-0.6B-4bit
Engine:   vllm-metal 0.13.0

─── Single request baseline ──────────────────────────────
  TTFT (short prompt, ~20 tokens):  341ms
  TTFT (long prompt, ~500 tokens):  487ms  (+42%)
  TPOT (short):                     3.4ms  (≈ 294 tok/s)
  TPOT (long):                      4.2ms  (barely changed)
  Throughput (single request):      145.8 tok/s

─── Prefix cache ─────────────────────────────────────────
  Cold TTFT (first request):        753ms
  Warm TTFT (same prefix):          367ms  (51% reduction)
  Session cache hit ratio:          17.7%

─── KV pressure (4 concurrent × 500 tokens) ─────────────
  Wall time (all 4 concurrent):     24.9s
  Per-request throughput:           20.1 tok/s
  Result: continuous batching confirmed

─── Mixed load (5 users, 3:1 short/long via Locust) ─────
  short_prompt: avg=9,471ms  min=710ms  p50=1,900ms  max=23,219ms
  long_prompt:  avg=16,830ms min=3,152ms p50=31,000ms max=30,508ms

─── Ollama vs vLLM (3 req/s, 30s, vegeta) ───────────────
  Ollama / mistral 7B p50:         15,979ms  (reference — 10× larger)
  Ollama / qwen2.5:0.5b p50:       14,062ms  (primary comparison)
  Ollama / qwen3.5 (large):        timeout   (2.2% success)
  vLLM  / Qwen3-0.6B-4bit p50:     6,543ms
  vLLM advantage at p50:            2.15× faster
  vLLM advantage at p95:            1.83× faster

─── nginx K8s gateway (max_tokens=20, 45 requests) ──────
  p50:  232ms  |  p95: 277ms  |  min: 163ms  |  Success: 100%
  Note: mostly proxy overhead — max_tokens=20 is tiny inference

What This Points To

The Mac Mini experiments answered the questions they were designed for. TTFT/TPOT/KV cache behave exactly as the theory predicts. Continuous batching is real and measurable. The gateway layer adds overhead but doesn’t drop requests. And Ollama’s lack of continuous batching is not a footnote — it’s the difference between a useful serving system and one that falls over at 3 requests per second with a moderately sized model.

What the Mac Mini can’t answer: what happens when you separate prefill and decode onto dedicated hardware? What does EPP prefix-cache-aware routing look like in practice? And eventually — how do TGI, TensorRT-LLM, and SGLang compare under the same load test conditions? Those experiments need cloud GPU budget earmarked for specific Week 2 and Week 3 labs. When they happen, they’ll get their own posts with real numbers.

Next up: Post 3 covers llm-d deployment on a Lambda Labs GH200 — the ten things nobody tells you before you try to run a Helm-based inference scheduler on K3s, including the NIXL/RDMA failure that explains why single-GPU P/D disaggregation doesn’t work the way you’d hope.

All experiments run on M4 Mac Mini 16GB, vllm-metal 0.13.0, Qwen3-0.6B-4bit. Scripts will be made available soon via github repository. I’m an platform engineer with 11+ years in distributed systems currently going deep on LLM serving. I write what I actually measured, including the parts that hit walls.

GitHub · LinkedIn

What Is LLM Inference, Really? A Deep Technical Walkthrough

2026-04-14T00:00:00+00:00

Let me be honest with you. When I started working on LLM infrastructure, I had eleven years of distributed systems experience. I knew Kafka, Kubernetes, Prometheus. I could debug a partition rebalance in my sleep.

And yet the first time someone asked me what actually happens during inference, I said something like “the model reads the prompt and generates tokens.” Which is technically true the same way “a database reads your query and returns rows” is technically true — accurate, useless, and deeply embarrassing for someone drawing a principal engineer’s salary.

This post is what I wish I’d had on day one. We’re going to walk through the entire inference pipeline — from the moment your request arrives to the moment you see text on screen — with real examples, honest explanations of where the performance goes, and enough detail that you can actually reason about production problems.

No “and then the transformer does its thing.” No skipped steps. Strap in.

1. What Is Inference?

Training is where you take a massive dataset, run it through a model millions of times, and slowly adjust billions of numerical weights until the model gets good at predicting the next word. Training is done once (or occasionally). It costs millions of dollars in GPU-hours and requires a team of researchers.

Inference is what happens afterward, every time someone uses the model. It’s the model using those learned weights to respond to new input. No weights change. No learning happens. It’s pure forward-pass computation.

Think of it like this: training is baking the bread. Inference is slicing it and serving it to customers. The bread (weights) is done. The kitchen (inference engine) just has to plate it fast enough that the queue doesn’t back up to the street.

The inference engine is the runtime that takes the frozen model weights and executes them against an input. The same weights can run on Ollama, vLLM, TensorRT-LLM, or TGI — and get meaningfully different performance from each. The weights don’t change. The execution strategy does.

This distinction matters operationally: inference is not a solved problem. Serving a model efficiently at scale is a full engineering discipline.

2. The Artifact: What’s Actually in That 10GB Download?

When you run ollama pull mistral or grab a model from HuggingFace, you aren’t just downloading a “program.” You’re downloading a massive, frozen brain in a box. If you’ve ever wondered why a model that “just chats” takes up 10GB of your SSD, it’s because it is packed with billions of tiny numerical “preferences” the model learned during its training phase.

Think of the GGUF (or Safetensors) file as a giant Ikea flat-pack box. To build the working model, you need two things: the Instruction Manual and the Hardware.

What’s inside a 7B parameter model file:

GGUF file structure (simplified):
├── Header
│   ├── Model architecture (LlamaForCausalLM)
│   ├── Vocabulary (32000 tokens + their embeddings)
│   ├── Context length (4096, 8192, etc.)
│   └── Hyperparameters (n_layers, n_heads, etc.)
│
└── Weight tensors:
    ├── token_embeddings        [32000 × 4096]   ← the embedding matrix
    ├── layer.0.attention.q     [4096 × 4096]    ← Query projection weights
    ├── layer.0.attention.k     [4096 × 4096]    ← Key projection weights
    ├── layer.0.attention.v     [4096 × 4096]    ← Value projection weights
    ├── layer.0.attention.out   [4096 × 4096]    ← Output projection
    ├── layer.0.ffn.up          [4096 × 11008]   ← Feed-forward up
    ├── layer.0.ffn.down        [11008 × 4096]   ← Feed-forward down
    ├── ... × 32 layers
    └── output_norm + lm_head   [32000 × 4096]   ← Final projection to logits

The “Manual” (The Header)

This is the first few kilobytes of the file. It tells the inference engine (like Ollama or vLLM) how to put the brain together. It includes:

The Architecture: Identifies the model type (e.g., LlamaForCausalLM) so the engine knows which math rules to apply.
The Vocabulary: A dictionary of roughly 32,000 to 128,000 “tokens” (the syllables the model speaks).
The Hyperparameters: Crucial settings like the number of layers (32 or 80) and the context length (how much it can remember).

The “Hardware” (The Tensors)

The rest of that file is just rows and rows of numbers called Weights. Every inference request is essentially looking up values from these matrices and multiplying them together—32 times over.

The “Part”	What it actually does in plain English
`token_embeddings`	The Translator. Turns human text into the model’s internal number-language.
`attention.q, k, v`	The Highlighters. Helps the model decide which part of your sentence is important.
`ffn.up` & `ffn.down`	The Reasoning Muscles. Does the heavy lifting of processing and transforming information.
`lm_head`	The Microphone. Turns the final internal math back into a word you can read.

Quantization: Shrinking the Brain

You might notice some files are 15GB while others are 4GB for the same model. This is Quantization—the art of compression. We turn high-precision 16-bit floats into lower-precision integers (like 4-bit).

Precision	Bits per weight	7B Model Size	The SRE Reality
FP16	16	~14GB	Requires an A100. Pristine quality.
INT8	8	~7GB	Fits on a high-end gaming GPU. Minimal loss.
INT4 (Q4_K_M)	4	~4GB	The Sweet Spot. Fits on a MacBook. Faster throughput.

Why SREs love INT4: Lower precision = smaller tensors = faster memory transfers. Because decoding is memory-bound, an INT4 model often delivers 20-40% better TPOT (tokens per second) than the “full” version because the memory bus isn’t screaming as loud.

The takeaway: You aren’t executing code; you are loading a massive, math-heavy lookup table. GGUF is your single-file “box,” and quantization is how you fit that box into a smaller truck (your GPU).

3. The Three Phases: A Map Before the Territory

Every inference request goes through three broad phases. They are not equally expensive, not equally parallelizable, and not equally friendly to your p99 latency.

┌──────────────────────────────────────────────────────────────────┐
│  TOKENIZATION  │  Prefill (Prompt Processing)   │ Decode         │
│    (CPU)       │           (GPU, compute)       │ Loop(GPU, Mem) │
│                │                                │                │
│                │                                │ current →      │
│  Text → IDs    │  Embed → Position → Attention  │ next token     │
└──────────────────────────────────────────────────────────────────┘
     Fast          Scales with prompt length        Slow

Tokenization: Split the text into token IDs the model understands. CPU-bound. Fast.
Prefill: Process the entire prompt through the model. GPU compute-bound. Scales with prompt length.
Decode: Generate output tokens one at a time. GPU memory-bound. Runs in a loop until done.

Each phase has its own bottleneck. Let’s go through them one by one.

4. Tokenization: Chopping Text Into Numbers

Before a single GPU operation happens, your text has to be converted into a format the model can work with: a sequence of integers called token IDs.

A token is not a character, and it’s not a word. It’s a chunk of text that appears frequently enough in the training corpus to deserve its own ID. There are several ways to build this vocabulary — WordPiece (used by BERT), Unigram (used by SentencePiece), and others — but the dominant approach in modern LLMs is Byte Pair Encoding (BPE): a compression algorithm that iteratively merges the most common pairs of characters or subwords into single tokens until it reaches a target vocabulary size.

The result is a vocabulary of roughly 32,000–128,000 tokens, each with a corresponding integer ID. The model never sees your text — it sees a list of numbers.

Take our example prompt: "The cat sat"

After tokenization, this becomes something like:

"The"  → 1026
" cat" → 5992
" sat" → 3290

Token IDs: [1026, 5992, 3290]

Note the space before “cat” and “sat” — it’s part of the token. Tokenizers care about whitespace because it affects meaning and frequency.

Is Tokenization CPU-Bound?

Yes. The tokenizer is usually written in Rust (HuggingFace’s tokenizers crate) or C++ for exactly this reason. For most requests it’s fast enough to be invisible — microseconds for a short prompt.

Where it bites you: very long documents fed to batch processing jobs. A 100,000-token context requires processing 100,000 token lookups. It’s still fast relative to GPU work, but it’s the one step in the pipeline running on CPU that you can’t just throw more GPU at.

How it’s improved: Parallelizing tokenization across CPU cores for batch workloads. Or — and this is the real fix — not re-tokenizing the same content repeatedly. If you have a shared system prompt you send to every request, tokenizing it once and caching the result is free latency.

5. Prefill: The Model Reads Your Prompt

Now we have token IDs. The model needs to turn those IDs into something it can reason about. This is prefill — the model processing the entire prompt in one shot.

Prefill has two sub-steps that are easy to conflate: embedding lookup and the actual transformer forward pass. Let’s take them in order.

The Embedding Matrix

Every token ID maps to a high-dimensional vector of floating-point numbers called an embedding. These vectors live in the model’s embedding matrix — a giant lookup table with one row per vocabulary token and one column per embedding dimension.

For a model with a 32,000-token vocabulary and 4,096 embedding dimensions, this matrix has shape [32000, 4096]. At 16-bit float precision, that’s about 256MB just for the embedding layer.

Our example [1026, 5992, 3290] becomes:

Token ID 1026 → embedding row 1026 → [0.12, -0.43, 0.81, ..., 0.07]  (4096 values)
Token ID 5992 → embedding row 5992 → [-0.34, 0.91, 0.12, ..., -0.22] (4096 values)
Token ID 3290 → embedding row 3290 → [0.67, 0.05, -0.88, ..., 0.44]  (4096 values)

I’m simplifying to 8 dimensions here so this fits on a page. In reality it’s 4,096 or 8,192 dimensions depending on the model.

Simplified (3D instead of 4096D), just to show the shape:

"The"  → [0.12, -0.43,  0.81]
" cat" → [-0.34,  0.91,  0.12]
" sat" → [0.67,  0.05, -0.88]

Shape: [3 tokens × 3 dims] = a matrix of floats

These vectors aren’t random. They’re the result of training — the model has learned that “cat” and “dog” live close together in this space, and “cat” and “quantum mechanics” are far apart. The geometry encodes semantic meaning.

How Do the Model Weights Help Here?

The embedding matrix IS the model weights, specifically. The 10GB (or 40GB, or 70GB) file you download — the GGUF or safetensors file — contains all the weight matrices the model learned during training. The embedding lookup is literally indexing into one of those weight matrices by row number.

When you run inference, you’re not computing anything creative. You’re doing matrix math against frozen numbers that were tuned over millions of training iterations.

6. Positional Embeddings: Teaching the Model About Order

Here’s a problem: the embedding lookup is a table lookup. It doesn’t care that “cat” is token 2 and “sat” is token 3. Two requests with the same tokens in different orders would produce identical embeddings.

But order matters enormously. “The cat sat on the dog” and “The dog sat on the cat” have the same tokens and very different meanings.

Positional embeddings solve this by adding a position-aware vector to each token’s embedding. The model learns that “token at position 1” feels different from “token at position 5,” even if the token ID is the same.

How Is It Calculated?

There are two main approaches:

Sinusoidal (original Transformers paper): Compute a fixed sine/cosine wave pattern based on position and dimension index. Deterministic, no learned parameters.

RoPE (Rotary Position Embedding): Used by Llama, Qwen, Mistral, and most modern models. Instead of adding a vector, it rotates the query and key vectors by an angle proportional to position. The result: the dot product between two token representations naturally captures their relative distance. Elegant, and generalizes better to longer contexts than the training data.

Continuing our example. After adding positional information:

"The"  at position 0: [0.12, -0.43, 0.81] + pos(0) → [0.15, -0.40, 0.84]
" cat" at position 1: [-0.34, 0.91, 0.12] + pos(1) → [-0.31, 0.88, 0.09]
" sat" at position 2: [0.67, 0.05, -0.88] + pos(2) → [0.65, 0.03, -0.86]

The position vectors are small adjustments. Their real value is that when attention is computed later, the model can tell whether two tokens are adjacent or 200 positions apart.

CPU Bottleneck in Prefill?

Embedding lookup and positional encoding are fast operations. The real CPU bottleneck in prefill is less about these steps and more about data movement: loading the right weight tensors from CPU RAM to GPU VRAM before the transformer forward pass can begin.

For very large models that don’t fully fit in VRAM, the CPU-GPU transfer becomes the bottleneck — you’re constantly paging weight blocks in. This is why model quantization matters: a 4-bit quantized model uses less VRAM, fits entirely on GPU, and eliminates this transfer overhead. More on that in a moment.

7. The Transformer Layers: Where the Real Work Happens

After embedding + positional encoding, we have a matrix of shape [sequence_length × embedding_dim]. This matrix now passes through N transformer layers — 32 layers for Llama-3.2-3B, 80 layers for Llama-3.1-70B.

Each layer applies:

Self-attention: every token looks at every other token and decides what’s relevant
Feed-forward network (FFN): each token’s representation is independently transformed

This is where the model’s “reasoning” happens — and where most of the GPU compute goes during prefill. All tokens are processed in parallel within a layer, making prefill compute-bound. More tokens = more compute = higher TTFT.

We’ll cover the attention mechanism in detail in section 9. First, let’s see what comes out.

8. Decoding: One Token at a Time, Forever

After prefill, the model produces its first output token. Then it produces another. Then another. Each token depends on all previous tokens. This is the decode loop.

Here’s what makes decode fundamentally different from prefill: you can’t parallelize it. Token N can’t be computed until token N-1 exists. It’s inherently sequential.

Let’s walk through two steps with our example. Our prompt was “The cat sat” and let’s say the model is going to output “on the mat.”

Decode Step 1: Predicting “on”

After prefill, we have KV cache entries for “The”, “ cat”, “ sat” (we’ll explain KV cache shortly). Now:

The model takes the last token’s representation (“ sat”) and runs it through the transformer layers
At each layer, attention is computed between “ sat” and all previous tokens via the KV cache
The final layer outputs a vector of size [vocabulary_size] — one score per possible next token. This is called the logits vector.
The logits are converted to probabilities via softmax
A token is sampled from this distribution (more below)
Result: token ID for “ on” → " on" is emitted as the first output token

KV cache: ["The", " cat", " sat"]
Current:  " sat" (last input token)
Attention: " sat" attends to "The", " cat", " sat"
Output logits: [0.001, 0.003, ..., 0.45 (" on"), ..., 0.002]
Sample: " on" ✓

Decode Step 2: Predicting “the”

Now “ on” has been generated. We add it to context:

Embed token “ on” → one new embedding vector (just one token, not the whole sequence)
Add positional embedding for position 4
Run through transformer layers, attending to KV cache for [“The”, “ cat”, “ sat”, “ on”]
Output logits → sample → “ the”

KV cache: ["The", " cat", " sat", " on"]  ← one entry added
Current:  " on" (new last token)
Attention: " on" attends to all four previous tokens
Output logits: [..., 0.67 (" the"), ...]
Sample: " the" ✓

And so it continues: “ mat” → “.” → token → stop.

The Sampling Step (Where Creativity Lives)

The logits give you a probability distribution. How you sample from it is where temperature, top-k, and top-p come in:

Greedy (temperature=0): always pick the highest probability token. Deterministic. Boring for creative tasks, good for code.
Temperature > 1: flatten the distribution. More randomness, more surprising outputs, more hallucinations.
Temperature < 1: sharpen the distribution. More conservative, more predictable.
Top-k: only sample from the top K most probable tokens. Ignores the long tail.
Top-p (nucleus sampling): only sample from the smallest set of tokens whose cumulative probability exceeds p. Adaptive — sometimes that’s 2 tokens, sometimes 50.

This step is trivially cheap computationally but has enormous impact on output quality. As an SRE, you don’t usually tune this — but you will get bug reports when someone’s temperature=2.0 config makes the model output Shakespeare from a JSON API endpoint.

9. Why Memory Is the Decode Bottleneck

Here’s the thing about decode that makes it hard to optimize: on every single decode step, the model needs to run attention against all previous tokens. Not a summary of them. All of them. Via the KV cache.

For a 1000-token conversation, every decode step reads 1000 rows of KV tensors from GPU memory. For a 32-layer model, that’s 32 reads of a large tensor. For a 7B model, the KV entry for a single token at a single layer is tens of kilobytes.

The GPU’s compute cores can execute these operations fast. But they’re waiting on HBM (High Bandwidth Memory) to deliver the data. The memory bus saturates before the compute does.

The GPU is memory-bound during decode, not compute-bound.

This is why adding more CUDA cores doesn’t help decode performance as much as adding more memory bandwidth. It’s why H100s are faster than A100s for serving despite similar compute specs — the memory bandwidth jump matters more for decode than the FLOP count.

A rough intuition: during prefill, GPU utilization is high and memory is barely stressed. During decode, GPU compute is mostly idle and the memory bus is screaming.

10. KV Cache: The Most Important Data Structure in Inference

We keep mentioning the KV cache. Let’s make it concrete.

During the attention step in each transformer layer, every token produces two vectors: a Key (K) and a Value (V). These are used in attention computation: other tokens use your Key to decide how much to attend to you, and then use your Value to extract information from you.

During decode, token N needs to compute attention against tokens 1 through N-1. If we recomputed K and V for all previous tokens on every step, we’d be doing O(N²) work per decode step. That’s catastrophic.

The KV cache solves this: after computing K and V for a token, we store them. On the next decode step, we only compute K and V for the new token and look up the rest from cache.

After prefill of "The cat sat":
KV cache = {
  layer_0: { K: [k_The, k_cat, k_sat], V: [v_The, v_cat, v_sat] }
  layer_1: { K: [...], V: [...] }
  ...  (32 layers total)
}

After generating " on":
KV cache = {
  layer_0: { K: [k_The, k_cat, k_sat, k_on], V: [v_The, v_cat, v_sat, v_on] }
  ...
}
← one new entry appended per layer per decode step

The cache grows with every token generated. When the KV cache fills GPU memory:

New requests queue (they have nowhere to store their KV tensors)
Long requests get partially evicted and have to recompute (latency spike)
In the worst case: OOM crash

KV cache occupancy is the single most important resource to monitor in a serving system. It determines how many concurrent requests you can serve and how long those requests can be. When vllm:kv_cache_usage_perc starts approaching 0.9, you’re about to have a bad time.

Prefix Caching: Free Speedups

If two requests share the same system prompt, their KV cache entries for that prefix are identical. Prefix caching stores those entries once and reuses them.

In practice on my M4 Mac Mini:

Without prefix cache (cold):  TTFT 753ms
With prefix cache (warm):     TTFT 367ms
Savings:                       51% reduction in TTFT
Zero model changes required.

This is why your production system prompt should be at the beginning of every request, and why it should be identical byte-for-byte across requests. Drift in the system prompt = cache misses = higher TTFT = sad users.

11. Attention: The Mechanism That Makes It Work

Let’s go one level deeper into what happens at each transformer layer. Attention is the core operation. Everything else is bookkeeping.

Self-Attention in Plain English

For each token, attention computes: “which other tokens should I be paying attention to, and by how much?”

It does this via three learned projections of each token’s embedding:

Query (Q): “what am I looking for?”
Key (K): “what do I offer?”
Value (V): “what information do I contain?”

For each token, you compute its dot product with every other token’s Key. High dot product = high attention score = attend more to that token. The scores are normalized via softmax, then used to weight-sum all the Value vectors.

Example with “The cat sat”:

Token: " sat"
Q_sat · K_The  = 0.8  → attend heavily to "The"
Q_sat · K_cat  = 0.9  → attend most to " cat" (makes sense)
Q_sat · K_sat  = 0.3  → attend a little to itself

Attention weights after softmax: [0.35, 0.55, 0.10]

Output = 0.35 × V_The + 0.55 × V_cat + 0.10 × V_sat

The output for “ sat” is now a blend of information from all tokens, weighted by relevance. After 32 such layers, the model has a rich, contextualized representation of every token in the sequence.

Paged Attention: Virtual Memory for KV Cache

Standard attention assumes the KV cache is one contiguous block of memory per request. This is wasteful: you have to pre-allocate the maximum possible sequence length upfront, and if the request ends early, that memory is wasted until the request completes.

PagedAttention (the key innovation in vLLM) borrows from OS virtual memory. Instead of one contiguous block, KV cache is stored in fixed-size pages (blocks) that can be non-contiguous in physical GPU memory. A page table maps logical token positions to physical blocks.

Without PagedAttention:
  Request A: [KKKKKKKKKK........] ← pre-allocated 20 slots, using 10, wasting 10
  Request B: [KKKKKK............] ← pre-allocated 20 slots, using 6, wasting 14

With PagedAttention:
  Block pool: [B1][B2][B3][B4][B5][B6][B7][B8]...
  Request A:  blocks B1, B3, B7 (non-contiguous, allocated on demand)
  Request B:  blocks B2, B4    (uses only what it needs)

The result: much higher memory utilization, more concurrent requests, less waste. Prefix cache blocks can be shared between requests with identical prefixes — only one copy of the system prompt’s KV entries needed, regardless of how many requests use it.

PagedAttention is why vLLM typically serves 2-4x more concurrent requests than a naive implementation on the same hardware.

12. Continuous Batching: The Throughput Unlock

Early inference servers were naive: accept a batch of requests, run them all through the model together, return all responses. Simple.

The problem: requests finish at different times. Short requests had to wait for long requests to complete before the next batch could start. GPU utilization looked like a sawtooth wave.

Continuous batching (also called iteration-level scheduling) fixes this. Instead of batching at the request level, the inference engine batches at the decode step level. Every iteration, it assembles the currently active tokens — some mid-generation, some just starting — into a single GPU operation.

When a request finishes, its slot is immediately available for a new request. When a new request arrives, it joins the active batch at the next iteration rather than waiting for the next batch boundary.

The result: GPU utilization stays high, latency for new requests is low, and throughput scales with the number of concurrent requests the KV cache can support — not with some fixed batch size parameter you tuned last Tuesday.

vLLM, TGI, and TensorRT-LLM all implement continuous batching. Ollama does not (as of early 2025). This is one of the primary reasons vLLM serves at 2x the throughput of Ollama under concurrent load.

13. The Metrics That Matter

Now that you know what’s happening, the metrics become obvious rather than mysterious.

Real metrics from a running llm-d deployment: TTFT p50 at 15ms, ITL p50 at 5ms, KV cache prefix hit rate at 80.6% — exactly the four numbers you should have on your wall during an incident.

TTFT — Time to First Token

Definition: Wall clock time from request submission to first output token.

What it captures: Prefill time + queuing time. If your GPU is busy, TTFT absorbs the wait.

PromQL:

histogram_quantile(0.95,
  sum(rate(vllm:time_to_first_token_seconds_bucket[2m])) by (le)
)

SLO guidance: < 500ms for interactive chat. < 2s is tolerable. > 5s and users think it’s broken.

Root causes when high:

Long prompts (expected — prefill scales with length)
GPU under heavy load, requests queuing
Insufficient prefill capacity (in P/D disaggregated setups)

ITL — Inter-Token Latency (aka TPOT)

Definition: Average time between consecutive output tokens during decode. Inverse of token generation speed.

What it captures: Decode throughput per active request. Memory bandwidth is the primary lever.

PromQL:

histogram_quantile(0.95,
  sum(rate(vllm:time_per_output_token_seconds_bucket[2m])) by (le)
)

SLO guidance: < 30ms is fast, streaming feels smooth. > 100ms and you notice the typewriter effect.

Root causes when high:

Too many concurrent requests (KV cache reads competing)
Large model + small GPU memory bandwidth
KV cache approaching capacity

KV Cache Hit Ratio

Definition: Fraction of prompt tokens whose KV vectors were served from prefix cache vs recomputed.

What it captures: Effectiveness of prefix caching. High hit ratio = lower TTFT for repeated system prompts.

PromQL:

vllm:gpu_prefix_cache_hit_rate

Target: > 0.5 for most chat workloads with consistent system prompts. Near 0 means your prompts are fully unique (batch processing, no shared prefix).

KV Cache Usage

Definition: Fraction of total KV cache capacity currently occupied.

PromQL:

vllm:gpu_cache_usage_perc

Alert threshold: > 0.85. At 0.9+, vLLM starts evicting in-progress requests, causing recomputation and latency spikes. At 1.0, new requests queue entirely.

Scaling Strategy by Traffic Shape

This is where the SRE work actually lives:

Traffic Pattern	Symptom	Action
Many short prompts, many users	TTFT fine, ITL rising, KV cache full	Scale out (more replicas), or reduce `max_model_len`
Few long prompts, long outputs	TTFT high, ITL high, KV cache full fast	Larger GPU memory, or P/D disaggregation
Bursty traffic, idle baseline	P99 TTFT spikes, P50 fine	Horizontal scaling + request queuing
Consistent system prompt across requests	High TTFT on cold start only	Enable prefix caching (already default in vLLM >= 0.4)
Mixed short and long context	Unpredictable KV usage	Set per-request `max_tokens` limits strictly

The strategic insight: short prompt, many concurrent users → decode is the bottleneck, optimize for memory bandwidth and parallelism across requests. Long context, few users → prefill is the bottleneck and KV cache pressure is the constraint; P/D disaggregation helps by giving prefill its own GPU.

14. Where Does the Time Actually Go?

After running these experiments on real hardware, here’s the honest answer:

Typical inference request (short prompt, moderate load):

Tokenization:          ~0.5ms   (CPU, negligible)
Embedding lookup:      ~1ms     (GPU memory read)
Prefill (32 layers):   ~40ms    (GPU compute, scales with prompt length)
First token decode:    ~3-5ms   (GPU memory read, KV cache write)
...each subsequent token: ~3-5ms

Total TTFT:            ~45ms under no load
Total TTFT at p95:     300-500ms under load (queuing dominates)

Where load makes it worse:

Queuing before prefill starts: your request sits behind other long prefills. TTFT absorbs the entire queue wait.
KV cache contention during decode: more concurrent requests = more KV cache reads per step = higher ITL for everyone.
Memory fragmentation: without PagedAttention, wasted KV cache blocks reduce effective concurrency.

What’s predicted to improve this:

Speculative decoding: a small “draft” model generates 4-5 tokens speculatively; the large model verifies them in one forward pass. If accepted, 4 tokens for the price of ~1.5 forward passes. Reduces ITL dramatically under low concurrency, hurts at high concurrency (wasted draft compute).
P/D disaggregation: dedicated prefill GPUs handle prompt processing, dedicated decode GPUs handle generation. Eliminates the resource contention between phases. Requires fast interconnect (NVLink or RDMA) for KV transfer.
Flash Attention 3: kernel-level optimization that keeps attention computation in SRAM longer, reducing HBM reads. Already default in vLLM on H100.

15. The Inference Engines Worth Knowing

Not all inference engines are created equal, and the right tool depends on your constraints.

Ollama

Origin: Open source, community
Strengths: Dead simple to run, supports Apple Silicon natively, GGUF format
Weaknesses: No continuous batching (as of early 2025), worse throughput under concurrent load
When to use: Local development, single-user experiments, quick model testing
When not to use: Serving more than one user concurrently
GitHub: github.com/ollama/ollama

vLLM

Origin: UC Berkeley, now a major open-source project with significant industry contributors
Strengths: PagedAttention, continuous batching, Prometheus metrics out of the box, P/D disaggregation via llm-d
Weaknesses: More complex setup, CUDA-first (Apple Silicon support via vllm-metal is experimental)
When to use: Production serving, multi-user, research with real load
When not to use: You just want to run one model locally and don’t want to think about it
GitHub: github.com/vllm-project/vllm

TGI (Text Generation Inference)

Origin: HuggingFace
Strengths: First-class support for new HuggingFace models, FlashAttention, tensor parallelism
Weaknesses: Slower to adopt innovations than vLLM, somewhat opinionated config
When to use: You’re already in the HuggingFace ecosystem and want good defaults
GitHub: github.com/huggingface/text-generation-inference

TensorRT-LLM

Origin: NVIDIA
Strengths: Best possible performance on NVIDIA hardware, optimized kernels, inference graph compilation
Weaknesses: NVIDIA-only, complex setup, compiled engines are model+hardware-specific (can’t move them)
When to use: You have a fixed model, fixed NVIDIA hardware, and need maximum performance
When not to use: You want flexibility, you’re running experiments, or you don’t own your hardware
GitHub: github.com/NVIDIA/TensorRT-LLM

The META / Research Options

llama.cpp — github.com/ggerganov/llama.cpp: The CPU-first runtime. Runs quantized models on CPU, reasonably fast, the ancestor of Ollama.
ExLlamaV2 — github.com/turboderp-org/exllamav2: Highly optimized for RTX GPUs specifically. EXL2 quantization format is more sophisticated than GPTQ or AWQ — per-layer bit allocation instead of uniform quantization.
MLC-LLM — github.com/mlc-ai/mlc-llm: Cross-platform (compiles to CUDA, Metal, Vulkan). Good for deploying to diverse hardware.

16. The Languages Behind It All

If you open the source code of a modern inference engine, here’s what you’ll find:

Python: The top layer. API server, request handling, scheduling logic, metric collection. vLLM’s scheduler and OpenAI-compatible API are Python. This is also where most bugs live.

CUDA (C++): The performance layer. Attention kernels, memory management, GPU operations. Flash Attention is CUDA. PagedAttention’s physical block management is CUDA. If Python is the restaurant, CUDA is the kitchen.

Rust: The fast utilities layer. HuggingFace’s tokenizers library is Rust — because tokenizing millions of requests fast matters. NIXL (the KV cache transfer layer in llm-d) has Rust/C++ components. Growing presence in inference tooling.

Go: The orchestration layer. Kubernetes operators, control plane tooling, health checks, routing logic. If you’re writing infrastructure around inference — operators, routers, schedulers — Go is where the work happens.

C++ (non-CUDA): llama.cpp is pure C++ with optional CUDA/Metal backends. TensorRT-LLM has heavy C++ in the engine.

For SREs/DevOps: You live in Python (scripts, load tests), Go (operators, K8s controllers), and YAML (unfortunately). CUDA is worth being able to read — not write, just understand why a kernel fusion matters and what a grid/block size means. Rust fluency is a genuine differentiator if you want to contribute upstream.

17. What CPU vs. Memory Intensive Means, Summarized

After all of the above, here’s the clean summary of which hardware resource each step stresses:

Step	Hardware	Why
Tokenization	CPU	Text processing, hash lookups, BPE merges
Embedding lookup	GPU memory	Row lookup in large weight matrix
Positional encoding	GPU compute	Fast arithmetic, small matrix
Prefill (attention + FFN)	GPU compute	Matrix multiplications, all tokens in parallel
Decode attention	GPU memory bandwidth	KV cache read per step, memory-bound
Decode FFN	GPU compute	Weight matrix multiply per step
KV cache management	GPU memory	Allocation, paging, eviction
Sampling (logit → token)	GPU compute	Softmax + sample, fast

The pattern: prefill is compute-bound, decode is memory-bound. This is the fundamental tension that all inference optimization — P/D disaggregation, speculative decoding, PagedAttention, quantization — is ultimately trying to resolve.

18. What an SRE Actually Needs to Know

You don’t need to write CUDA. You don’t need to derive the attention formula. But you do need a mental model that lets you answer these questions in production:

Why is TTFT high? → Prefill bottleneck or queuing. Long prompts? GPU saturated?
Why is ITL degrading? → KV cache pressure. Too many concurrent requests. Memory bandwidth saturating.
Why did the GPU OOM? → KV cache exhausted. Too many long requests, no eviction headroom.
Why is throughput low? → No continuous batching. Poor concurrency config. Batch size too small.
Why does prefix caching not help? → System prompt is changing per-request. Fix the app layer.
Which GPU should I buy? → For inference: memory bandwidth matters more than FLOP count. H100 > A100 for serving not because of compute but because of HBM3 bandwidth.
Why is quantization worth it? → A 4-bit model serves faster (decode is memory-bound, smaller tensors = faster reads) and fits on cheaper hardware. Quality loss is usually acceptable.

The mental model in one sentence: LLM inference is split between a compute-hungry prefill phase and a memory-hungry decode phase, connected by a KV cache that is your most critical resource to manage.

Everything else — PagedAttention, continuous batching, P/D disaggregation, speculative decoding, quantization — is an optimization layered on top of that fundamental structure.

The Summary: A Mental Model for Production

If you’ve made it this far, you’ve realized that LLM inference isn’t magic; it’s a measurable, optimizable systems engineering challenge. The execution pipeline breaks down into three distinct phases with unique bottlenecks: Tokenization (CPU-bound), Prefill (GPU compute-bound), and Decode (GPU memory-bound).

To help you debug that 2:00 AM latency spike, here is the final synthesis of the mechanics we’ve covered:

The “War Room” Reference Table

Phase	Core Mechanism	Primary Bottleneck	SRE Metric to Watch
Ingress	Tokenization	CPU / Latency	Tokenizer Latency
Processing	Prefill	GPU Compute (FLOPs)	TTFT (Time to First Token)
Generation	Decode Loop	Memory Bandwidth	ITL/TPOT (Inter-token Latency)
State	KV Cache	VRAM Capacity	Cache Usage % & Hit Rate

The Final Principles (Your TL;DR)

Weights are Static Data: The GGUF or Safetensors file is essentially a massive, math-heavy lookup table. You aren’t executing code; you are performing matrix math against frozen numbers.
Quantization is a Free Lunch: Lowering precision (e.g., to INT4) reduces tensor size, which directly loosens the memory bottleneck during decoding, often improving throughput by 20-40%.
The KV Cache is Your Most Precious Resource: It prevents $O(N^2)$ recomputation by storing token state. Managing this via PagedAttention and Prefix Caching is what separates a toy demo from a production-grade service.
Attention is the Relationship Engine: It uses Queries, Keys, and Values to calculate which tokens matter to each other. It’s why the model understands context, but it’s also why memory pressure scales with your prompt length.
Continuous Batching is the Efficiency Unlock: By batching at the “word” level rather than the “request” level, you keep the GPU busy even when individual users have wildly different response lengths.

The ultimate constraint in scaling model serving isn’t raw compute power, but memory bandwidth and the strict management of the KV Cache. By understanding the mechanical realities beneath the hood—like PagedAttention, continuous batching, and quantization—infrastructure engineers can move past guesswork and systematically optimize for the metrics that dictate user experience: TTFT and TPOT.

Conclusion: The Field Is Moving, The Fundamentals Aren’t

The implementation details of LLM inference in 2025/2026 are changing fast enough to give you whiplash, but the underlying physics of the problem haven’t moved an inch. Prefill will always be compute-hungry. Decode will always be memory-hungry. Attention will always scale quadratically with context length unless someone breaks the math. These aren’t framework quirks; they are the mechanical realities of how transformers work.

Transitioning to LLMOps requires a fundamental shift in how we manage system state. We are no longer scaling stateless pods; we are actively managing distributed GPU memory. The engineering headroom to optimize this is enormous, and the landscape is shifting rapidly:

The model size curve is bending: Smaller, highly-optimized 7B models are now punching above the weight of older 70B giants, distributing the inference problem across a much wider variety of hardware.
The memory bottleneck is softening: Bleeding-edge KV cache compression techniques are reducing per-token memory footprint from kilobytes down to a handful of bytes, loosening the strict constraints of the decode phase.
The edge is becoming viable: As inference pushes to mobile NPUs and WebGPU, serving a 3B parameter model starts to look less like a Kubernetes workload and more like a firmware binary.

Yet, the operational reality remains unchanged. When a model is in production and serving real traffic, someone has to know why TTFT spiked at 2am, why the KV cache hit 95% utilization under Tuesday’s load, and why the p99 ITL is three times the p50.

Mastering the mechanics of inference separates a fragile AI prototype from a resilient production platform. The inference stack will keep evolving. The need for a rigorous mental model to debug it won’t.

Next up: In the next post, we’ll take these first principles and see how they dictate the architectural trade-offs behind the major inference engines: Ollama, vLLM, TGI, and TensorRT-LLM.

I’m an infrastructure engineer with 11+ years in distributed systems (D-Wave, Enbala, MasterCard, Cisco), currently going deep on LLM serving and inference optimization. This series is grounded in hands-on experiments — Mac Mini to Lambda Labs GH200 to RunPod A100 clusters. I write what I actually learned, including the parts that didn’t work.

Find me on GitHub: kraghavan

Find me on Linkedin: Karthika Raghavan

Schema Travels Architecture

2026-04-06T00:00:00+00:00

Translating SQL to NoSQL: Architecture Deep-Dive

Part 1 of 2: Design decisions, trade-offs, and algorithms behind schema-travels

Why I Built This

Migrating from a relational SQL database to a NoSQL paradigm is notoriously difficult to automate. Every DBA has war stories: the naive migration that turned a 3-table JOIN into three round trips, or the “just flatten everything” approach that created 50GB documents.

The problem I chose: How do you automate a context-aware SQL-to-NoSQL schema migration without relying on raw, hallucination-prone LLM outputs?

The key insight: algorithms are excellent at graph clustering; LLMs are not. But algorithms lack business context. So I built a dual-engine system where deterministic algorithms do the math, and an LLM acts as a reviewing Principal Architect—bounded, structured, and unable to hallucinate schemas into existence.

This post walks through every architectural decision in schema-travels, including the trade-offs I considered and the bugs that taught me humility.

System Overview

Architecture Diagram (generated by NotebookLM)

The Schema Travels architecture: multi-provider AI review with specialized MongoDB and DynamoDB migration flows

Components

┌──────────────────────────────────────────────────────────────────────────────┐
│                              schema-travels                                  │
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────────┐  │
│  │                        Deterministic Engine                            │  │
│  │                                                                        │  │
│  │  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐                 │  │
│  │  │  SQL Log    │    │  Access     │    │  Target     │                 │  │
│  │  │  Parser     │───▶│  Pattern    │───▶│  Schema     │                 │  │
│  │  │             │    │  Analyzer   │    │  Designer   │                 │  │
│  │  │ • PostgreSQL│    │             │    │             │                 │  │
│  │  │ • MySQL     │    │ • HotJoins  │    │ • MongoDB   │                 │  │
│  │  │ • 10K+ qps  │    │ • Mutations │    │ • DynamoDB  │                 │  │
│  │  └─────────────┘    │ • Co-access │    │ • Union-Find│                 │  │
│  │                     └─────────────┘    └──────┬──────┘                 │  │
│  └───────────────────────────────────────────────┼────────────────────────┘  │
│                                                  │                           │
│                                    Algorithmic Draft (JSON)                  │
│                                                  │                           │
│  ┌───────────────────────────────────────────────▼────────────────────────┐  │
│  │                         LLM Review Layer                               │  │
│  │                                                                        │  │
│  │  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐                 │  │
│  │  │  Provider   │    │  Advisor    │    │  Pydantic   │                 │  │
│  │  │  Protocol   │───▶│  (Reviewer) │───▶│  Validator  │                 │  │
│  │  │             │    │             │    │             │                 │  │
│  │  │ • Claude    │    │ • Critique  │    │ • Bounded   │                 │  │
│  │  │ • GPT-4o    │    │ • Refine    │    │ • Typed     │                 │  │
│  │  │ • Gemini    │    │ • Explain   │    │ • No schema │                 │  │
│  │  │ • Ollama    │    │             │    │   invention │                 │  │
│  │  └─────────────┘    └─────────────┘    └──────┬──────┘                 │  │
│  └───────────────────────────────────────────────┼────────────────────────┘  │
│                                                  │                           │
│                                    Reviewed Design (JSON)                    │
│                                                  │                           │
│  ┌───────────────────────────────────────────────▼────────────────────────┐  │
│  │                         Output & Caching                               │  │
│  │                                                                        │  │
│  │  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐                 │  │
│  │  │  Cache      │    │  Terraform  │    │  NoSQL      │                 │  │
│  │  │  Manager    │    │  Generator  │    │  Workbench  │                 │  │
│  │  │             │    │             │    │  Export     │                 │  │
│  │  │ • Strict    │    │ • .tf files │    │             │                 │  │
│  │  │ • Relaxed   │    │ • GSI defs  │    │ • Import    │                 │  │
│  │  │ • ~3s hits  │    │ • Capacity  │    │   ready     │                 │  │
│  │  └─────────────┘    └─────────────┘    └─────────────┘                 │  │
│  └────────────────────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────────────────────┘

Component 1: The Statistical Ground Truth

The Problem

You can design a relational database in a vacuum using normal forms. You cannot design a NoSQL database without knowing the access patterns.

Feeding an LLM just the CREATE TABLE statements produces generic, unoptimized schemas. It’s like asking an architect to design a house without knowing if it’s for a family of four or a fraternity.

My Solution: Log-Based Pattern Analysis

Before any AI is invoked, the pipeline ingests thousands of raw SQL queries (up to 10K in my benchmarks). Two analyzers extract the structural signals:

Analyzer	What It Measures	Why It Matters
HotJoinAnalyzer	Co-access frequency between tables	If `orders` and `order_items` are JOINed in 95% of queries, they belong together
MutationAnalyzer	Write patterns per table	High-mutation tables need different caching strategies

The output is a weighted co-access matrix—essentially a graph where edge weights represent how often two tables appear together in queries.

Design Decision: Static Schema vs. Log Perusal

Trade-off considered:

Approach	Speed	Accuracy	Limitations
Static schema only	<1s	~60%	Misses actual usage patterns
Log perusal	5-30s	~90%	Requires query logs

My choice: Log perusal is non-negotiable. A schema where users and user_preferences are separate tables tells you nothing about whether they’re accessed together. Query logs tell you everything.

Lesson learned: Early versions didn’t weight by query frequency. A table JOINed once in a rare admin query got the same weight as one JOINed 10,000 times per hour. Adding frequency weighting dramatically improved recommendations.

Component 2: MongoDB Document Designer

The Problem

MongoDB’s core design question: Embed or Reference?

Strategy	When to Use	Risk
Embed	High co-access, bounded growth	Document bloat (16MB limit)
Reference	Independent access, unbounded growth	N+1 query patterns

Getting this wrong is expensive. Embed a million reviews inside a product document and MongoDB will hate you. Reference user addresses that are always fetched with the user and you’ve recreated SQL’s JOIN problem.

My Solution: Confidence-Weighted Decisions

The algorithm takes the co-access score from Component 1 and calculates a Confidence Score for each relationship:

Confidence = (co_access_weight × access_factor) + (bounded_growth × stability_factor)

≥ 0.85: Strong embed recommendation (green in output)
0.70-0.85: Moderate confidence (yellow)
< 0.70: Reference or needs human review (red)

The color-coding isn’t cosmetic—it tells developers where to focus their review time.

Design Decision: Schema Only vs. Schema + Queries

Trade-off considered:

Approach	Output	Developer Effort
Schema only	JSON structure	Developer writes queries
Schema + queries	Structure + aggregation pipelines	Copy-paste ready

My choice: Generate both. For MongoDB, schema-travels outputs the exact aggregation pipelines or findOne queries that replace the original SQL JOINs.

Why this matters: A schema migration isn’t done when you have a new data model. It’s done when your application code works. Giving developers the target schema and the code to query it cuts migration time significantly.

Component 3: DynamoDB Paradigm Shift

The Problem

DynamoDB isn’t just “MongoDB but AWS.” It’s a fundamentally different paradigm:

MongoDB	DynamoDB
Flexible queries	Pre-defined access patterns
Indexes added later	GSIs designed upfront
Nested documents	Flat, wide-column design
Query anything	Query only what you planned for

The infamous “Single-Table Design” pattern—where multiple entity types share one table with carefully crafted partition and sort keys—is powerful but alien to SQL developers.

My Solution: Union-Find Access Clustering

This is where the computer science degree earns its keep. The DynamoDBDesigner uses a Union-Find (Disjoint Set Union) algorithm to cluster SQL tables based on their relationship weights.

How it works:

Each SQL table starts as its own cluster
For each edge (co-access relationship) above a threshold weight:
- Find the cluster roots of both tables
- Union them if they’re frequently accessed together
Result: clusters of tables that should share a DynamoDB table

Input: users, orders, order_items, products, reviews
       
Co-access weights:
  users ←→ orders:      0.92  (high)
  orders ←→ order_items: 0.95  (high)
  products ←→ reviews:   0.45  (low)
  
After Union-Find:
  Cluster 1: {users, orders, order_items}  → Single-table candidate
  Cluster 2: {products}                     → Separate table
  Cluster 3: {reviews}                      → Separate table

The algorithm then generates PK/SK patterns and GSI candidates for each cluster.

Design Decision: LLM Generation vs. Algorithmic Draft

Trade-off considered:

Approach	Consistency	Quality	Hallucination Risk
LLM generates schema	Low	Variable	High
Algorithm generates, LLM reviews	High	Consistent	Low

My choice: Algorithms generate the draft; LLMs review it.

Union-Find is deterministic—same input always produces the same clusters. LLMs are not. Asking an LLM to “design a DynamoDB schema” is asking for creative writing. Asking it to “review this algorithmic draft and flag issues” is asking for structured critique.

Lesson learned: Early versions let the LLM suggest entity renames during review. This broke the mapping back to the algorithmic draft, causing cryptic “Entity Not Found” errors. Now the validator explicitly forbids schema invention.

Component 4: The Multi-Provider LLM Layer

The Problem

I needed LLM review capabilities but didn’t want to be locked into one provider. Different providers have different strengths, costs, and availability.

My Solution: Protocol-Oriented Provider Abstraction

The LLMProvider protocol defines what any provider must support:

Method	Purpose
`complete()`	Generate a response
`supports_json_mode`	Whether native JSON mode is available
`name`, `model`	Provider identification

Any class implementing this protocol works with the Advisor:

Provider	Default Model	Cost	Best For
Claude	claude-sonnet-4	~$3/1M	Complex reasoning
OpenAI	gpt-4o-mini	~$0.15/1M	Cost-effective structured output
Gemini	gemini-2.0-flash	~$0.10/1M	Speed
Ollama	llama3.1:8b	Free	Privacy, offline

Design Decision: Agentic Generation vs. Agentic Review

Trade-off considered:

Approach	LLM Role	Control	Consistency
Agentic generation	Creator	Low	Low
Agentic review	Critic	High	High

My choice: The LLM is a reviewer, not a creator.

The prompt never says “design a schema.” It says “here is the algorithmic draft—critique it.” The LLM can flag issues:

“The algorithm suggested Single-Table Design here, but this Partition Key has low cardinality and will cause a hot partition. Consider adding a write-sharding suffix.”

But it cannot invent new tables or fundamentally restructure the design. Pydantic validators enforce this boundary.

Component 5: Caching & Infrastructure Output

The Problem

LLM calls are slow (2-30 seconds) and expensive. Re-analyzing the same schema during CI/CD is wasteful.

My Solution: Dual-Mode Hashing

Mode	Hash Includes	Use Case
Relaxed	Schema structure, log patterns	Development iteration
Strict	Full payload including metadata	Production, testing

Relaxed mode hashes only the structural elements. Minor metadata changes hit the cache, dropping a 25-second pipeline to ~3 seconds.

Strict mode requires cryptographic match of the entire input. Essential for reproducible E2E testing.

Closing the Loop: Infrastructure as Code

A schema design on paper is useless if it can’t be deployed. The pipeline terminates by generating actual IaC:

Output	Format	Use
Terraform	`.tf`	`terraform apply` ready
NoSQL Workbench	`.json`	Visual modeling, data import

The Terraform output includes table definitions, GSI configurations, and capacity settings. Not a template—actual runnable infrastructure.

Lesson learned: My Terraform formatter had a subtle bug—it generated $${table_name} instead of "${table_name}" in some edge cases. The E2E test matrix caught this because it validated the .tf files could be parsed by Terraform’s HCL parser.

What I’d Do Differently

1. DynamoDB Query Translation from Day One

The MongoDB module translates SQL JOINs to aggregation pipelines. The DynamoDB module only outputs schemas—no Boto3 query code. This asymmetry bothers me.

Generating dynamodb.query(KeyConditionExpression=...) calls alongside the schema would make the DynamoDB path as developer-friendly as MongoDB.

2. Stricter Entity Name Validation

The LLM occasionally renames entities during review (“I’ll call this CustomerOrders instead of user_orders”). This breaks the mapping back to the algorithmic draft.

The fix was a Pydantic validator that rejects any entity name not in the original input. But I should have anticipated this—LLMs love to be “helpful” by renaming things.

3. Cache Key Design Up Front

I added the dynamodb_mode parameter to cache keys late. This meant cached results from auto mode were incorrectly served for single mode requests.

Cache key design should happen during architecture, not debugging.

The Proof: What Actually Happened

Let me show you what this architecture produces in practice. I ran the E2E test matrix across 4 providers × 2 targets on a 42-table e-commerce schema with 100 synthetic queries:

┌─────────────┬──────────────┬──────────────┐
│  Provider   │   MongoDB    │   DynamoDB   │
├─────────────┼──────────────┼──────────────┤
│ claude      │ ✓ PASS       │ ✓ PASS       │
│ openai      │ ✓ PASS       │ ✓ PASS       │
│ gemini      │ ✓ PASS       │ ✓ PASS       │
│ ollama      │ ✓ PASS       │ ✓ PASS       │
└─────────────┴──────────────┴──────────────┘
Total: 8 tests | Passed: 8 | Failed: 0

What the numbers tell us:

Provider	DynamoDB Design	Confidence	GSIs Generated
Claude	single_table	85%	4 (overloaded)
OpenAI	single_table	77.5%	5 (descriptive names)
Gemini	single_table	75%	5 (direct attributes)
Ollama (local)	multi_table	80%	per-table

Here’s the insight that makes this architecture work: the algorithmic clusters are identical across all providers. Products had 82 accesses. Users had 54. Orders had 32. The Union-Find algorithm doesn’t care which LLM you’re using—it produces the same deterministic foundation every time.

But the interpretation differs. Three cloud providers looked at the same clusters and said “single-table design with GSI overloading.” The local model (gemma3:4b running on my Mac Mini) said “multi-table is fine here.”

Just like every architecture review I’ve ever been in—except this one finished in 22 seconds and nobody rage-quit to “work from home.”

Who’s right? Honestly, both approaches are defensible for this workload. The point isn’t that one answer is correct—it’s that the AI is reviewing a solid algorithmic foundation, not hallucinating schemas from scratch.

MongoDB showed similar patterns:

The Claude review of MongoDB produced 12 recommendations with confidence-scored decisions:

orders → order_items: EMBED (95% confidence) — “Perfect co-access, always accessed together”
products → reviews: REFERENCE (90% confidence) — “Unbounded growth, popular products can have thousands of reviews”
users → addresses: EMBED (85% confidence) — “High co-access, addresses typically accessed with user profile”

That 90% confidence REFERENCE decision for reviews? That’s exactly the kind of nuanced judgment that makes NoSQL design hard. The algorithm detected high co-access, but the AI correctly identified the unbounded growth risk that would blow past MongoDB’s 16MB document limit.

More than 40 tables migrated, zero DBAs traumatized. Though if your schema has more foreign keys than a hotel concierge, maybe grab coffee first.

The Bigger Picture: What This Actually Solves

The Old Way: A senior engineer spends 2-3 weeks analyzing query logs, drawing ER diagrams on whiteboards, debating embed-vs-reference decisions in meetings, and manually translating that into DynamoDB access patterns. Then they write Terraform by hand and pray they didn’t miss a GSI.

The New Way: Feed the tool your PostgreSQL logs and schema. Get a reviewed, confidence-scored design with deployable Terraform in 22 seconds. Spend those 2-3 weeks on the parts that actually require human judgment—data migration strategy, application refactoring, rollback planning.

This isn’t about replacing engineers. It’s about replacing the tedious parts of engineering so we can focus on the interesting parts.

The Core Innovation: Bounded AI

The fundamental insight behind this architecture is that LLMs should critique, not create.

Approach	What Can Go Wrong
“LLM, design me a schema”	Hallucinated tables, inconsistent naming, unbounded creativity
“LLM, review this algorithmic draft”	Bounded feedback, structured output, deterministic baseline

By forcing the AI into a reviewer role with Pydantic-enforced boundaries, we get the benefits of LLM intelligence (business context, heuristic reasoning, natural language explanations) without the chaos of unbounded generation.

This pattern—algorithmic draft + bounded LLM review—is applicable far beyond schema migration. It’s how I’d approach any problem where you need AI assistance but can’t afford hallucinations.

What’s Next: GraphRAG for Enterprise Scale

The current implementation handles schemas with 10-50 tables comfortably. But what happens when you’re migrating a legacy enterprise system with 500 tables? 1,000?

At that scale, the co-access matrix becomes unwieldy, the visualization becomes spaghetti, and even Union-Find starts to strain under the combinatorial explosion.

The roadmap: GraphRAG-powered schema analysis.

The idea is straightforward:

Store the schema graph in a proper graph database (NetworkX + SQLite for now, Neo4j for enterprise)
Embed table semantics using sentence transformers
Query with natural language: “Show me all tables related to order fulfillment” or “Which clusters would be affected if we split the users table?”

This transforms schema-travels from a migration tool into a schema intelligence platform. Instead of processing everything at once, you can explore the graph, ask questions, and migrate incrementally—cluster by cluster, bounded by what your team can absorb.

Coming Up: Part 2

In Part 2, we’ll dive deeper into the E2E test results:

Latency breakdown: 22 seconds for first run → 3 seconds cached
Provider comparison: Why Claude, OpenAI, and Gemini agreed on single-table while Ollama chose multi-table
The cache drift bug: How the test matrix caught an inconsistency in DynamoDB Terraform output
Cost analysis: Running the full matrix cost less than $0.50 in API calls

Final Thoughts

I’ve been building distributed systems for over a decade. In that time, I’ve seen plenty of “AI-powered” tools that are really just prompt wrappers—impressive demos that fall apart the moment you need reproducibility.

schema-travels is my answer to the question: How do you build AI-assisted tooling that a principal engineer would actually sign off on?

The answer, it turns out, is the same principle that makes distributed systems reliable: don’t trust any single component. Algorithms provide the deterministic foundation. LLMs provide the intelligence. Pydantic provides the guardrails. Caching provides the economics. And an 8-test E2E matrix provides the confidence that it all actually works.

The schema migration problem was just the vehicle. The real artifact is an architecture pattern for building AI tools that are auditable, reproducible, and won’t make your on-call engineer cry at 3am.

GitHub: github.com/kraghavan/schema-travels

Questions, feedback, or war stories from your own migrations? Connect with me on LinkedIn.

Building a Privacy-Aware LLM Gateway: Benchmarking Results

2026-03-21T00:00:00+00:00

Building a Privacy-Aware LLM Gateway: Benchmarking Results

Part 2 of 2: Empirical evaluation of classification accuracy, routing performance, and cost attribution

Abstract

In Part 1, I described the architecture of inference-sentinel, a privacy-aware LLM routing gateway. This post presents empirical results from five experiments evaluating classification accuracy, routing latency, cost efficiency, controller effectiveness, and session stickiness.

Key findings:

The hybrid classifier achieves 97.5% accuracy with 0.16ms mean latency, though a systematic failure mode in Tier 3 detection reveals the importance of enabling NER for healthcare identifiers
Local inference introduces a 10× latency penalty compared to cloud backends, a fundamental trade-off for privacy preservation
Routing 47.5% of traffic locally yields 68.6% cost savings, with an unexpected finding that Google’s Gemini is 44× cheaper per request than Anthropic’s Claude
The closed-loop controller correctly withheld recommendations when local-cloud quality divergence exceeded thresholds

1. Experimental Setup

1.1 Hardware Configuration

Component	Specification
Gateway Host	Docker container (inference-sentinel)
Local Inference	Apple Mac Mini M4, 16GB unified memory
Local Models	Ollama serving gemma3:4b and mistral (round-robin)
Cloud Backends	Claude Sonnet 4 (Anthropic), Gemini 2.0 Flash (Google)

1.2 Dataset

I constructed a synthetic evaluation dataset of 200 prompts with known ground-truth privacy labels, balanced across four tiers:

Tier	Label	Count	Example Patterns
0	PUBLIC	50	General knowledge questions
1	INTERNAL	50	Project codes, internal URLs
2	CONFIDENTIAL	50	Email addresses, phone numbers
3	RESTRICTED	50	SSNs, credit cards, health records

The balanced design enables per-class precision/recall analysis without class imbalance confounds.

1.3 Experimental Protocol

Each experiment was run independently with the gateway in a fresh state. Metrics were collected via Prometheus and exported to JSON for analysis. All experiments used the same dataset to enable cross-experiment comparison.

2. Experiment 1: Classification Accuracy

Research Question: How accurately does the hybrid classifier assign privacy tiers, and what are the failure modes?

2.1 Results

Metric	Value
Overall Accuracy	97.5% (195/200)
Mean Classification Time	0.16ms
Misclassifications	5

Figure 1: Per-tier precision, recall, and F1 scores. The dashed line indicates the 95% threshold. Tier 3 recall (90%) falls below threshold due to undetected health insurance identifiers.

2.2 Per-Tier Analysis

Tier	Precision	Recall	F1	TP	FP	FN
0 (PUBLIC)	90.9%	100%	95.2%	50	5	0
1 (INTERNAL)	100%	100%	100%	50	0	0
2 (CONFIDENTIAL)	100%	100%	100%	50	0	0
3 (RESTRICTED)	100%	90%	94.7%	45	0	5

The asymmetry between Tier 0 and Tier 3 is notable: Tier 0 has 5 false positives (items incorrectly classified as PUBLIC), while Tier 3 has 5 false negatives (restricted items missed). These are the same 5 samples — Tier 3 prompts misclassified as Tier 0.

Figure 2: Confusion matrix showing classification performance. Off-diagonal entries appear only in the Tier 3→Tier 0 cell, indicating a systematic failure mode rather than random errors.

2.3 Error Analysis

All five misclassifications share a common pattern:

Prompt ID	Text Preview	Expected	Predicted
tier3_277eb0aa	“Patient Alyssa Nelson, health insurance ID: FRJ508021882, prescribed Lisinopril 10mg daily.”	Tier 3	Tier 0
tier3_67f1698c	“Patient Joseph Hill, health insurance ID: WRY526164289, prescribed Metformin 25mg twice daily.”	Tier 3	Tier 0
tier3_8989c5ab	“Patient Michael Weaver, health insurance ID: IIP473415078, prescribed Atorvastatin 25mg twice daily.”	Tier 3	Tier 0
tier3_02ed73ce	“Patient Levi Fowler, health insurance ID: WRV424211872, prescribed Metformin 500mg with meals.”	Tier 3	Tier 0
tier3_bb134f02	“Patient Brandon Davis, health insurance ID: SIE176051319, prescribed Metformin 10mg daily.”	Tier 3	Tier 0

Root Cause Analysis:

NER was enabled during this benchmark, yet the PERSON_NAME entities were not detected. The failures stem from two factors:

Missing regex pattern: The health insurance ID format ([A-Z]{3}\d{9}) is not covered by existing Tier 3 patterns, which target SSNs (\d{3}-\d{2}-\d{4}), credit cards (Luhn-valid sequences), and MRN patterns (MRN:\s*\d+).
NER model limitation: The HuggingFace Transformers BERT model (dslim/bert-base-NER) failed to recognize the person names in medical record context. The phrase structure “Patient [Name], health insurance ID…” appears to confuse the model — likely because “Patient” is parsed as part of the name span, or the surrounding medical terminology disrupts entity boundary detection.

Examining the detection results:

{
  "expected_entities": ["PERSON_NAME", "PERSON_NAME"],
  "detected_entities": []
}

The NER model returned zero entities despite clear person names being present. This is a known limitation of lightweight NER models on domain-specific text — medical, legal, and financial documents often require fine-tuned models.

Implication: The 10% false negative rate on Tier 3 represents exactly the failure mode that matters most in a privacy system — restricted data being classified as public. This is not acceptable for production deployment without remediation.

2.4 Remediation

Three complementary approaches:

# Fix 1: Add regex pattern for health insurance IDs
PATTERNS["health_insurance"] = r'\b(?:health\s*insurance\s*id|member\s*id)[:\s]*[A-Z]{2,4}\d{8,12}\b'

# Fix 2: Add "Patient [Name]" pattern for medical contexts
PATTERNS["patient_name"] = r'\bPatient\s+[A-Z][a-z]+\s+[A-Z][a-z]+\b'

# Fix 3: Consider larger NER model for production
# "accurate" mode uses Jean-Baptiste/roberta-large-ner-english

The regex-first approach is particularly important here: rather than relying solely on NER for entity detection, adding domain-specific patterns provides a deterministic safety net for known sensitive formats.

3. Experiment 2: Routing Performance

Research Question: What latency overhead does the gateway introduce, and how does local inference compare to cloud?

3.1 Results

Metric	Value
Total Requests	200
Successful	183 (91.5%)
Failed	17 (8.5%)
Total Duration	1,421 seconds
Throughput	0.13 req/s

Figure 3: End-to-end latency distribution showing heavy right tail. The p99 latency (62.9s) is 26× higher than p50 (2.4s), indicating high variance primarily from local inference.

3.2 Latency Decomposition

Component	Mean Latency	% of Total
Classification	1.47ms	0.02%
Routing Decision	0.22ms	0.003%
Inference	7,729ms	99.98%

Key Finding: The gateway overhead (classification + routing) is 1.69ms — effectively invisible relative to inference time. The privacy-aware routing layer does not meaningfully impact end-to-end latency.

3.3 Latency by Route

Route	Tier	Count	Mean	p50	p95
Cloud	0 (PUBLIC)	50	1,669ms	1,755ms	2,737ms
Cloud	1 (INTERNAL)	50	1,571ms	1,182ms	3,059ms
Local	2 (CONFIDENTIAL)	42	15,523ms	9,949ms	60,722ms
Local	3 (RESTRICTED)	36	16,643ms	5,959ms	48,418ms

The latency trade-off is stark: Local inference is approximately 10× slower than cloud. This is the fundamental cost of privacy preservation with consumer-grade hardware.

3.4 Error Analysis

All 17 failures returned HTTP 503: No healthy local backends available. Examining the error distribution:

Tier	Failures	Total Requests	Failure Rate
Tier 2	8	50	16%
Tier 3	9	50	18%
Tier 0-1	0	100	0%

Failures occurred exclusively on local-routed traffic. The root cause is memory pressure: the Mac Mini M4 with 16GB unified memory struggles to serve concurrent requests across two loaded models (gemma3:4b ≈ 3GB, mistral ≈ 4GB).

Mitigation strategies:

Reduce to single local model (eliminates round-robin memory contention)
Implement request queuing with backpressure
Upgrade to 32GB+ RAM for concurrent model serving
Increase health check timeout to tolerate transient memory pressure

4. Experiment 3: Cost Attribution

Research Question: What are the realized cost savings from local routing, and how do cloud backends compare?

4.1 Results

Metric	Value
Actual Cost	$0.0844
Hypothetical All-Cloud	$0.2687
Savings	$0.1843 (68.6%)

Figure 4: Cost comparison showing actual spend vs. hypothetical all-cloud routing. Local routing of Tier 2-3 traffic yields 68.6% cost reduction.

4.2 Routing Distribution

Figure 5: Request distribution by route. 42.6% of requests routed to local inference (privacy-sensitive), 57.4% to cloud (non-sensitive).

Route	Requests	Percentage
Local	95	47.5%
Cloud	105	52.5%

4.3 Backend Cost Analysis

Backend	Requests	Total Cost	Cost/Request	Cost/1K Tokens
Anthropic (Claude)	53	$0.0826	$0.00156	$0.0130
Google (Gemini)	52	$0.00185	$0.0000355	$0.00036
Local (Ollama)	95	$0.00	$0.00	$0.00

Figure 6: Cost distribution by backend. Anthropic accounts for 97.8% of cloud spend despite handling only 50.5% of cloud requests.

Unexpected Finding: Gemini is 44× cheaper per request than Claude ($0.0000355 vs $0.00156). This suggests a potential optimization: use Gemini as the primary cloud backend for cost-sensitive workloads, reserving Claude for quality-critical requests.

4.4 Cost by Privacy Tier

Tier	Requests	Routed Local	Cost	Savings
0 (PUBLIC)	55	0	$0.0400	$0.0420
1 (INTERNAL)	50	0	$0.0444	$0.0248
2 (CONFIDENTIAL)	50	50	$0.00	$0.0575
3 (RESTRICTED)	45	45	$0.00	$0.0600

Tier 2 and Tier 3 traffic incurs zero marginal cost after hardware investment. For organizations with significant sensitive data volumes, the ROI calculation favors local inference.

4.5 Projected Annual Savings

Extrapolating from observed cost ratios:

Daily Volume	Annual Cloud-Only	Annual with Sentinel	Savings
1,000 req	$490	$154	$336
10,000 req	$4,900	$1,540	$3,360
100,000 req	$49,000	$15,400	$33,600

Caveat: These projections assume similar traffic distribution (47.5% local-eligible) and do not account for hardware depreciation, electricity, or operational overhead.

5. Experiment 4: Controller Effectiveness

Research Question: Does the closed-loop controller generate actionable routing recommendations?

5.1 Results

Metric	Value
Controller Evaluations	117
Recommendations Generated	0
Drift Detected	No

5.2 Shadow Mode Metrics

Metric	Value
Shadow Runs	340
Successful Comparisons	275
Quality Match Rate	0%
Cost Savings Tracked	$0.15

5.3 Analysis

The controller generated zero recommendations because the quality match rate was 0%. This means local model responses (gemma3:4b, mistral) were semantically dissimilar enough from cloud responses (Claude, Gemini) that they never crossed the similarity threshold (default: 85%).

This is informative, not a failure. The controller correctly identified that:

Local models produce qualitatively different outputs than cloud models
Promoting Tier 0-1 traffic from cloud to local would degrade response quality
The conservative default (keep on cloud) is appropriate

5.4 Interpretation

The 0% quality match rate reflects the capability gap between 4B-parameter local models and frontier cloud models. For tasks where approximate answers suffice, lowering the similarity threshold would generate recommendations:

controller:
  quality_threshold: 0.70  # Down from 0.85

However, this requires explicit acceptance of quality trade-offs — a decision the controller correctly defers to human operators.

6. Experiment 5: Session Stickiness

Research Question: Does the one-way trapdoor correctly lock sessions after PII detection?

6.1 Results

Metric	Value
Sessions Tested	20
Requests per Session	10
PII Probability	30%
Sessions Locked	0
Trapdoor Violations	0

6.2 Analysis

Zero sessions were locked despite 30% of requests containing PII and session tracking being enabled. Examining the test methodology reveals the issue:

The benchmark harness sent requests with simulated IPs (10.0.0.0 through 10.0.0.19), but the session ID computation may not have properly differentiated these synthetic sources. Additionally, the PII detection failures identified in Experiment 1 (the 5 health insurance records) would have prevented those sessions from locking — if PII isn’t detected, the trapdoor isn’t triggered.

Contributing factors:

Classification dependency: Session locking requires Tier 2+ classification. The 10% Tier 3 false negative rate means some PII-containing requests were classified as Tier 0, preventing session locks.
Test methodology: Requests originated from the same physical host with simulated client IPs. The session ID hashing (SHA-256(client_ip + daily_salt)) should differentiate these, but the harness may need validation.
PII probability vs. detection: The 30% PII probability applies to dataset generation, but if those PII patterns aren’t detected by the classifier, sessions won’t lock.

6.3 What We Can Validate

Despite no sessions locking, the core privacy invariant held:

Trapdoor violations: 0 — No request with detected PII ever routed to cloud
Per-request classification: Functional — Requests that were classified as sensitive routed locally

The gap is between “contains PII” (ground truth) and “detected as PII” (classifier output).

6.4 Required Follow-Up

A proper session stickiness evaluation requires:

Fix classification first: Address the Tier 3 detection gaps so PII is actually detected
Validate session ID generation: Ensure synthetic client IPs produce distinct session IDs
Use diverse source IPs: Run from multiple actual hosts or containers
Add session state logging: Instrument the session manager to log state transitions

# Re-run with verified distinct sessions
python -m benchmarks.harness --experiment session --sessions 20 --verify-session-ids

Expected behavior after fixes:

Sessions should lock when Tier 2+ PII is detected
Subsequent requests in that session should route to local regardless of content
Locked session count should approximate sessions × pii_probability × detection_rate

7. Discussion

7.1 Principal Findings

Hypothesis	Result	Verdict
Classification adds minimal latency	1.69ms overhead	✅ Confirmed
Accuracy exceeds 95%	97.5% overall	✅ Confirmed
Local inference is slower	10× latency penalty	⚠️ Confirmed (expected)
Cost savings are significant	68.6% reduction	✅ Confirmed
Controller generates recommendations	0 recommendations	⚠️ By design (quality gap)
Sessions lock on PII detection	0 sessions locked	⚠️ Requires improved test methodology

7.2 Limitations

Dataset Size: 200 prompts is sufficient for detecting large effect sizes but underpowered for rare failure modes. A production evaluation should use 10,000+ samples.

Synthetic Data: The evaluation dataset was synthetically generated with known patterns. Real-world PII distributions may differ, particularly for domain-specific identifiers.

Single Hardware Configuration: Results reflect a specific hardware setup (M4 Mac Mini, 16GB). Performance characteristics will vary with different local inference hardware.

NER Model Limitations: The lightweight BERT NER model (dslim/bert-base-NER, “fast” mode) failed to detect person names in medical record contexts, revealing domain-specific NER gaps that require either fine-tuning or larger models like RoBERTa.

Session Test Methodology: The session stickiness experiment used simulated IPs from a single host, and classification failures prevented some PII-containing requests from triggering session locks.

7.3 Threats to Validity

Internal Validity: The 17 routing failures (8.5%) due to memory pressure may have biased latency statistics toward successful (potentially faster) requests.

External Validity: The balanced tier distribution (25% per tier) does not reflect production traffic, which is typically skewed toward Tier 0-1.

Construct Validity: Semantic similarity (cosine distance on embeddings) may not capture task-specific quality dimensions relevant to specific use cases.

8. Conclusion

8.1 Summary of Contributions

This work presents inference-sentinel, a privacy-aware LLM routing gateway that addresses a gap in the current MLOps landscape: the ability to enforce data residency policies at inference time without sacrificing developer experience.

What we built:

Component	Contribution
Hybrid Classifier	Two-stage pipeline (regex + NER) achieving 97.5% accuracy at 0.16ms latency — fast enough for real-time routing decisions
Session Manager	One-way trapdoor state machine ensuring PII-containing sessions are permanently locked to local inference, with cryptographic session ID hashing
Context Handoff	Rolling buffer mechanism preserving conversation continuity during mid-session cloud→local transitions, with optional PII scrubbing
Backend Manager	Pluggable selection strategies (priority, round-robin, latency-aware) with automatic health checking and failover
Shadow Mode	Non-blocking A/B comparison framework collecting quality metrics without impacting user-facing latency
Closed-Loop Controller	Rule-based recommendation engine that observes traffic patterns and suggests routing policy adjustments
Observability Stack	Full OpenTelemetry integration with Prometheus metrics, Grafana dashboards, and structured logging

What the benchmarks revealed:

The evaluation across 200 synthetic prompts demonstrated that privacy-aware routing is feasible with sub-2ms overhead. The 68.6% cost savings from local routing validates the economic case, while the 10× latency penalty quantifies the privacy-performance trade-off. The systematic Tier 3 failures (health insurance IDs) highlight the importance of domain-specific pattern engineering and robust NER model selection — even with NER enabled, lightweight models may miss entities in specialized contexts.

This is not a production-ready system — it is a proof of architecture demonstrating that the building blocks exist and compose correctly.

8.2 Future Work

8.2.1 Scaling the Evaluation

The current benchmark uses 200 prompts — sufficient for detecting large effects but underpowered for tail behavior analysis. Future work should include:

Benchmark	Purpose	Target
Extended classification	Rare pattern detection, edge cases	1,000+ prompts
Load testing	Concurrent request handling, memory pressure	100 req/s sustained
Adversarial evaluation	Evasion attempts, prompt injection	Red team dataset
Longitudinal study	Drift detection over weeks of traffic	Production deployment

Statistical power analysis suggests n=1,000+ is required to detect failure modes occurring at <1% frequency with 95% confidence.

8.2.2 Improving NER Accuracy

The current implementation uses HuggingFace Transformers with dslim/bert-base-NER (“fast” mode, ~400MB) for named entity recognition. Several directions merit exploration:

Model	Size	Tradeoff
`dslim/bert-base-NER` (fast)	~400MB	Current default, good speed, limited domain coverage
`Jean-Baptiste/roberta-large-ner-english` (accurate)	~1.3GB	Higher accuracy, 3-5× latency
`Davlan/bert-base-multilingual-cased-ner-hrl` (multilingual)	~700MB	Multi-language support
Fine-tuned NER	Variable	Domain-specific entities (PHI, financial identifiers)
Presidio	N/A	Microsoft’s PII detection library, rule + ML hybrid
GLiNER	200MB	Zero-shot NER, no fine-tuning required

The optimal choice depends on latency budget. For sub-50ms classification, the “fast” BERT model with expanded regex patterns may outperform larger models. For offline batch classification, RoBERTa-based NER (“accurate” mode) is viable.

8.2.3 GPU-Accelerated Local Inference

The current setup runs local models on Apple Silicon (M4 Mac Mini) using Metal acceleration via Ollama. While sufficient for development and low-throughput production, this architecture has limitations:

Constraint	Current	With GPU
VRAM	16GB unified	24-80GB dedicated
Concurrent models	1-2 (memory pressure)	3-4+
Throughput	~0.1 req/s	1-10 req/s
Model size	4-8B parameters	13-70B parameters

GPU deployment options:

NVIDIA GPU server (RTX 4090, A100): Run vLLM or TGI for high-throughput local inference
Cloud GPU instances (but local network): AWS/GCP instances in private VPC, data never leaves controlled infrastructure
Apple M4 Max/Ultra: 128GB unified memory enables 70B models with acceptable latency

The shadow mode quality metrics would likely improve significantly with larger local models, potentially enabling automatic traffic promotion.

8.2.4 Kubernetes Deployment

The current Docker Compose stack is suitable for single-node deployment. Production deployment requires:

┌─────────────────────────────────────────────────────────────┐
│                    Kubernetes Cluster                       │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐          │
│  │  Sentinel   │  │  Sentinel   │  │  Sentinel   │          │
│  │  Pod (HPA)  │  │  Pod (HPA)  │  │  Pod (HPA)  │          │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘          │
│         └────────────────┼────────────────┘                 │
│                          ▼                                  │
│                 ┌─────────────────┐                         │
│                 │ Redis (Session  │                         │
│                 │    State)       │                         │
│                 └─────────────────┘                         │
│                          │                                  │
│         ┌────────────────┼────────────────┐                 │
│         ▼                ▼                ▼                 │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐          │
│  │   Ollama    │  │   Ollama    │  │    vLLM     │          │
│  │  (gemma3)   │  │  (mistral)  │  │  (llama3)   │          │
│  │   Node 1    │  │   Node 2    │  │  GPU Node   │          │
│  └─────────────┘  └─────────────┘  └─────────────┘          │
└─────────────────────────────────────────────────────────────┘

K8s-specific requirements:

Horizontal Pod Autoscaler (HPA) for gateway pods based on request rate
Node affinity for GPU-accelerated inference pods
PodDisruptionBudget ensuring availability during rollouts
NetworkPolicy restricting egress to approved cloud endpoints
ServiceMesh (Istio/Linkerd) for mTLS between components

8.2.5 Session State: In-Memory vs. Persistent Storage

The current architecture stores session state in an in-memory data structure with TTL-based eviction. This is a deliberate design choice:

Approach	Pros	Cons
In-memory (current)	Zero latency, no external dependencies, automatic cleanup	Lost on restart, single-node only
Redis	Distributed, persistent, TTL support native	Additional infra, latency (+1-2ms), security surface
PostgreSQL	ACID, queryable, audit trail	Highest latency, schema management, backup complexity

Why in-memory is defensible:

Session data is ephemeral by design — 15-minute TTL means losing state on restart is acceptable for most use cases
Security through ephemerality — no persistent store means no data to exfiltrate, no backups to secure, no encryption-at-rest requirements
Operational simplicity — no Redis cluster to manage, no connection pooling, no failover logic
Latency — hash table lookup is O(1) with ~0.001ms; Redis adds network round-trip

When to move to Redis:

Multi-pod deployment requiring shared session state
Session TTL >1 hour (memory pressure)
Audit requirements mandating session history
Graceful restart without session loss

For the Redis migration path, the interface is already abstracted:

class SessionStore(Protocol):
    async def get(self, session_id: str) -> Optional[Session]: ...
    async def set(self, session_id: str, session: Session) -> None: ...
    async def delete(self, session_id: str) -> None: ...

# Swap implementations without changing business logic
store = RedisSessionStore(redis_url) if USE_REDIS else InMemorySessionStore()

8.3 Target Applications

inference-sentinel addresses use cases across multiple organizational functions:

Healthcare & Life Sciences

Use Case	Privacy Concern	Sentinel Solution
Clinical decision support	PHI in patient queries	Tier 3 classification for health records
Drug interaction lookup	Patient medication history	Session locking after first PHI exposure
Medical transcription	Dictated patient notes	Local inference for all transcription

Regulatory context: HIPAA requires technical safeguards for PHI. A gateway that provably routes PHI to local-only inference provides auditable compliance.

Financial Services

Use Case	Privacy Concern	Sentinel Solution
Customer service chatbots	Account numbers, SSNs	Tier 3 detection with immediate local routing
Fraud analysis	Transaction patterns	Shadow mode for quality validation before local promotion
Document summarization	Contracts with PII	Context handoff preserving conversation flow

Regulatory context: PCI-DSS, SOX, and GLBA impose data handling requirements that a privacy-aware gateway can enforce at the infrastructure layer.

Legal & Professional Services

Use Case	Privacy Concern	Sentinel Solution
Contract review	Client confidential information	Tier 2+ routing for all legal documents
Legal research	Case details, party names	Session stickiness ensuring full conversation stays local
E-discovery	Privileged communications	Mandatory local inference for attorney-client content

Enterprise IT & Security

Use Case	Privacy Concern	Sentinel Solution
Code review assistants	Proprietary source code	Internal URL and project code detection (Tier 1)
Security log analysis	Infrastructure details, credentials	API key and credential pattern detection (Tier 3)
Internal knowledge base Q&A	Employee PII, org structure	Configurable routing based on data classification

Human Resources

Use Case	Privacy Concern	Sentinel Solution
Resume screening	Candidate PII	Local inference for all recruitment workflows
Employee feedback analysis	Performance data	Session locking after employee identifier detection
Compensation benchmarking	Salary information	Tier 3 classification for financial PII

8.4 Broader Impact

The proliferation of LLM-powered applications creates a tension between capability and privacy. Cloud-hosted models offer superior performance but require transmitting potentially sensitive data to third parties. Local models preserve privacy but sacrifice quality and increase operational burden.

inference-sentinel demonstrates that this is not a binary choice. By making routing decisions at inference time based on content classification, organizations can:

Use cloud models for the 50%+ of traffic that contains no sensitive data — capturing the quality and cost benefits
Enforce local-only inference for genuinely sensitive content — preserving privacy guarantees
Measure the trade-off empirically — shadow mode quantifies exactly what quality you sacrifice for privacy

This is a fundamentally different approach from “all cloud” or “all local” architectures. It treats privacy as a first-class routing dimension, alongside latency and cost.

8.5 Closing Thoughts

Building inference-sentinel during my job search taught me more about LLM infrastructure than any tutorial could. Debugging why Ollama returns 404 (model name mismatch), why Grafana dashboards show “Value” instead of model names (missing metric labels), why round-robin wasn’t working (YAML override precedence) — these are the unglamorous details that separate working systems from prototypes.

The code is open source. The architecture is documented. The benchmarks are reproducible.

If you’re building LLM applications that handle sensitive data, I hope this work provides a useful reference — or at least saves you from repeating my mistakes.

Appendix A: Raw Data

A.1 Classification Misclassifications

All 5 errors follow the pattern:

Input: Health record with insurance ID format [A-Z]{3}\d{9}
Expected entities: PERSON_NAME (not detected by BERT NER in medical context)
Detected entities: []

A.2 Routing Errors

All 17 errors: HTTP 503: No healthy local backends available

Distribution: 8 Tier 2, 9 Tier 3 (local-routed traffic only)

A.3 Entity Detection Summary

Entity Type	Count	Tier Assignment
SSN	24	3
Credit Card	12	3
Health Record (MRN)	6	3
Bank Account	6	3
Email	18	2
Phone	14	2
Address	16	2
Internal URL	28	1
Project Code	12	1
Employee ID	10	1

GitHub: github.com/kraghavan/inference-sentinel

Questions, feedback, or collaboration ideas? Connect on LinkedIn.

Updated Last: March 24, 2026

Building a Privacy-Aware LLM Gateway: Architecture Deep-Dive

2026-03-20T00:00:00+00:00

Building a Privacy-Aware LLM Gateway: Architecture Deep-Dive

Part 1 of 2: Design decisions, trade-offs, and lessons from building inference-sentinel

Why I Built This

During my job search, I wanted a project that would demonstrate distributed systems thinking — not just “I can call an API,” but “I can design a system that handles real production concerns.”

The problem I chose: How do you route LLM prompts intelligently based on data sensitivity, while maintaining session continuity and observability?

This post walks through every architectural decision I made building inference-sentinel, including the trade-offs I considered and the mistakes I made along the way.

System Overview

Design Diagram generated by NotebookLLM

The Inference Sentinel architecture: privacy-aware routing with session stickiness and closed-loop control

Components

┌─────────────────────────────────────────────────────────────────────────────┐
│                              Application Layer                              │
│                         (Any OpenAI-compatible client)                      │
└─────────────────────────────────────┬───────────────────────────────────────┘
                                      │ POST /v1/inference
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                            inference-sentinel                               │
│                                                                             │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │                         Request Pipeline                             │   │
│  │                                                                      │   │
│  │  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐               │   │
│  │  │   Hybrid    │    │   Session   │    │   Backend   │               │   │
│  │  │ Classifier  │───▶│   Manager   │───▶│   Manager   │               │   │
│  │  │             │    │             │    │             │               │   │
│  │  │ • Regex     │    │ • Trapdoor  │    │ • Selection │               │   │
│  │  │ • NER       │    │ • Buffer    │    │ • Failover  │               │   │
│  │  │ • Tier 0-3  │    │ • Handoff   │    │ • Health    │               │   │
│  │  └─────────────┘    └─────────────┘    └─────────────┘               │   │
│  │         │                  │                  │                      │   │
│  │         ▼                  ▼                  ▼                      │   │
│  │  ┌─────────────────────────────────────────────────────────────────┐ │   │
│  │  │                    Telemetry Collector                          │ │   │
│  │  │         Metrics │ Traces │ Logs (OpenTelemetry-native)          │ │   │
│  │  └─────────────────────────────────────────────────────────────────┘ │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │                      Background Services                             │   │
│  │                                                                      │   │
│  │  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐               │   │
│  │  │   Shadow    │    │   Closed    │    │   Health    │               │   │
│  │  │   Mode      │    │   Loop      │    │   Check     │               │   │
│  │  │             │    │ Controller  │    │   Loop      │               │   │
│  │  │ • A/B test  │    │             │    │             │               │   │
│  │  │ • Compare   │    │ • Evaluate  │    │ • Poll      │               │   │
│  │  │ • Metrics   │    │ • Recommend │    │ • Failover  │               │   │
│  │  └─────────────┘    └─────────────┘    └─────────────┘               │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
                    ┌─────────────────┴─────────────────┐
                    ▼                                   ▼
        ┌─────────────────────┐             ┌─────────────────────┐
        │    Local Backends   │             │   Cloud Backends    │
        │                     │             │                     │
        │  ┌───────────────┐  │             │  ┌───────────────┐  │
        │  │ Ollama        │  │             │  │ Anthropic     │  │
        │  │ • gemma3:4b   │  │             │  │ • Claude      │  │
        │  │ • mistral     │  │             │  │   Sonnet 4    │  │
        │  └───────────────┘  │             │  └───────────────┘  │
        │                     │             │  ┌───────────────┐  │
        │  (Round-robin at    │             │  │ Google        │  │
        │   equal priority)   │             │  │ • Gemini      │  │
        │                     │             │  │   2.0 Flash   │  │
        └─────────────────────┘             │  └───────────────┘  │
                                            │                     │
                                            │  (Round-robin or    │
                                            │   primary/fallback) │
                                            └─────────────────────┘

Component 1: The Hybrid Classifier

The Problem

I needed to classify prompts into privacy tiers quickly and accurately. The obvious approaches each have limitations:

Approach	Latency	Accuracy	Limitations
Regex only	<1ms	~85%	Misses context, false negatives
LLM-based	200-500ms	~95%	Too slow for a gateway
NER only	15-50ms	~90%	Heavy for simple patterns

My Solution: Two-Stage Pipeline

class HybridClassifier:
    """
    Stage 1: Fast-path regex for obvious patterns
    Stage 2: NER for context-dependent detection (optional)
    """
    
    async def classify(self, text: str) -> HybridResult:
        # Stage 1: Regex (always runs, ~0.2ms)
        regex_result = self._regex.classify(text)
        
        # Tier 3 detected by regex — skip NER, route local immediately
        if self.skip_ner_on_tier3 and regex_result.tier >= 3:
            return HybridResult(
                tier=regex_result.tier,
                detection_method="regex",
            )
        
        # Check if NER should run
        if not self._should_run_ner(regex_result.tier):
            return HybridResult(
                tier=regex_result.tier,
                detection_method="regex",
            )
        
        # Stage 2: NER for additional context (~15-30ms on CPU)
        ner_result = await self._ner.classify(text)
        
        # Merge entities from both, take highest tier
        return self._merge_results(regex_result, ner_result)

Design Decision: Why Regex First?

Trade-off considered: I could run NER on everything for maximum accuracy, but:

Latency budget: A gateway adds latency to every request. I targeted <5ms for classification on the fast path (regex-only).
Resource cost: The NER classifier uses HuggingFace Transformers with BERT-based models (~400MB for the “fast” model, ~1.3GB for “accurate”). For high-throughput scenarios, this matters.
Diminishing returns: Regex catches 70%+ of Tier 3 patterns (SSN, credit cards, API keys) with near-zero cost.

The insight: Regex handles the “obvious” cases; NER handles the “subtle” cases (person names, organizations). Running both in parallel would be faster but would waste resources when regex already found Tier 3 data.

The 4-Tier Taxonomy

Tier	Name	Detection Method	Examples	Routing
0	PUBLIC	Default (no matches)	“Explain quantum computing”	Cloud
1	INTERNAL	Regex patterns	Internal URLs, project codes	Cloud
2	CONFIDENTIAL	NER + Regex	Person names (NER), emails, phones (regex)	Local (configurable)
3	RESTRICTED	Regex patterns	SSN, credit cards, health data	Local (enforced)

Design Decision: Why 4 Tiers?

Trade-off considered: Simpler would be binary (sensitive/not-sensitive). More granular could be 10+ categories.

Why 4:

Maps to common enterprise data classification schemes (Public/Internal/Confidential/Restricted)
Allows nuanced routing rules (Tier 2 is configurable, Tier 3 is enforced)
Shadow mode can target specific tiers for A/B testing
Simple enough to reason about, granular enough to be useful

Regex Pattern Design

Patterns are loaded from config/privacy_taxonomy.yaml, allowing customization without code changes:

# Example Tier 3 patterns (loaded from YAML at startup)
TIER_3_PATTERNS = {
    # Social Security Numbers (various formats)
    "ssn": r'\b\d{3}-\d{2}-\d{4}\b',
    "ssn_spoken": r'\b(?:social|ssn)[\s:]*\d{3}[\s-]?\d{2}[\s-]?\d{4}\b',
    
    # Credit Cards (Luhn-valid patterns)
    "credit_card": r'\b(?:4\d{3}|5[1-5]\d{2}|3[47]\d{2}|6(?:011|5\d{2}))'
                   r'[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b',
    
    # API Keys (common prefixes)
    "api_key": r'\b(?:sk-|pk_|api[_-]?key[_-]?)[a-zA-Z0-9]{20,}\b',
    
    # Health identifiers
    "medical_record": r'\b(?:MRN|patient[\s-]?id)[\s:]*[A-Z0-9]{6,}\b',
}

Lesson learned: The regex \b word boundary is essential. Without it, 123-45-6789 matches inside abc123-45-6789xyz, causing false positives.

Component 2: Session Manager (The One-Way Trapdoor)

The Problem

Per-request classification isn’t enough. Consider this conversation:

Turn 1: "Help me draft a cover letter"        → Tier 0 → Cloud ✓
Turn 2: "Here's my resume"                     → Tier 1 → Cloud ✓
Turn 3: "My SSN is 123-45-6789"               → Tier 3 → Local ✓
Turn 4: "Actually, format that differently"   → Tier 0 → Cloud? ❌

Turn 4 references Turn 3’s SSN implicitly. If we route it to cloud, we’ve leaked context.

My Solution: One-Way State Machine

                    ┌─────────────────────┐
                    │                     │
        ┌──────────▶│   CLOUD_ELIGIBLE    │
        │           │                     │
        │           └──────────┬──────────┘
        │                      │
        │         Tier 2/3 detected in any turn
        │                      │
        │                      ▼
        │           ┌─────────────────────┐
   NEVER            │                     │
        │           │    LOCAL_LOCKED     │
        │           │                     │
        └───────────└─────────────────────┘

Key property: The transition is irreversible. Once a session is LOCAL_LOCKED, no subsequent classification — even Tier 0 — can unlock it.

Implementation

class SessionState(str, Enum):
    """Session routing state."""
    CLOUD_ELIGIBLE = "cloud_eligible"  # Can route to cloud
    LOCAL_LOCKED = "local_locked"      # Must route to local (PII detected)


@dataclass
class SessionInfo:
    """Session metadata and state."""
    
    session_id: str  # SHA-256 hash of client IP + daily salt
    state: SessionState = SessionState.CLOUD_ELIGIBLE
    created_at: datetime = field(default_factory=lambda: datetime.now(timezone.utc))
    last_activity: datetime = field(default_factory=lambda: datetime.now(timezone.utc))
    
    # Lock metadata
    local_locked_at: Optional[datetime] = None
    lock_trigger_tier: Optional[int] = None
    lock_trigger_entities: list = field(default_factory=list)
    
    # Backend stickiness (maintains round-robin consistency within session)
    cloud_backend: Optional[str] = None   # "anthropic" or "google"
    local_backend: Optional[str] = None   # "gemma" or "mistral"
    
    def lock_to_local(self, tier: int, entities: list) -> None:
        """Lock session to local routing (one-way trapdoor)."""
        if self.state == SessionState.LOCAL_LOCKED:
            return  # Already locked — no-op
        
        self.state = SessionState.LOCAL_LOCKED
        self.local_locked_at = datetime.now(timezone.utc)
        self.lock_trigger_tier = tier
        self.lock_trigger_entities = entities.copy()


class SessionManager:
    """Manages session state with TTL expiration."""
    
    def __init__(
        self,
        ttl_seconds: int = 900,           # 15 minutes default
        lock_threshold_tier: int = 2,      # Tier 2+ triggers lock
        buffer_max_turns: int = 5,
        buffer_max_chars: int = 4000,
    ):
        # Sessions and buffers stored in separate TTL caches
        self._sessions: TTLCache = TTLCache(maxsize=10000, ttl=ttl_seconds)
        self._buffers: TTLCache = TTLCache(maxsize=10000, ttl=ttl_seconds)
        self._lock_threshold_tier = lock_threshold_tier
    
    async def update_session_state_async(
        self, 
        client_ip: str, 
        tier: int,
        entities: list,
    ) -> SessionInfo:
        """Update session state based on classification result.
        
        Applies the one-way trapdoor: if tier >= threshold, locks to local.
        """
        session = await self.get_or_create_session_async(client_ip)
        
        # Check if this classification triggers a lock
        if tier >= self._lock_threshold_tier and not session.is_local_locked:
            session.lock_to_local(tier, entities)
        
        return session

Design Decision: Why Hash Session IDs with Daily Salt?

Trade-off considered: Store client IPs as-is for debuggability vs. hash them for privacy.

Why hash with rotating salt:

Client IPs are PII — storing them raw creates compliance risk
Daily salt rotation prevents cross-day user tracking
If the session store is compromised, hashed IDs reveal nothing
Lookup is O(1) either way with a good hash

class DailySalt:
    """Manages daily-rotating salt for session ID generation."""
    
    def hash_with_salt(self, value: str) -> str:
        """Hash a value with current salt."""
        combined = f"{value}:{self._current_salt}"
        return hashlib.sha256(combined.encode()).hexdigest()


def generate_session_id(client_ip: str) -> str:
    """Generate session ID from client IP with daily-rotating salt."""
    salt = get_daily_salt()
    return salt.hash_with_salt(client_ip)

Design Decision: TTL and Eviction

Trade-off considered: Long TTL preserves sessions across browser refreshes. Short TTL reduces memory footprint.

My choice: 15 minutes (900 seconds), configurable.

Reasoning:

Most conversations complete within 15 minutes
Matches typical “idle timeout” for sensitive applications
Memory overhead: ~2KB per session × 10K sessions = 20MB (acceptable)
Uses TTLCache from cachetools for automatic expiration

Component 3: Context Handoff

The Problem

When a session locks mid-conversation, the local model has no context. It doesn’t know what the user was asking about.

But we can’t just forward the full conversation history — it contains the very PII we’re trying to protect.

My Solution: Rolling Buffer with Dual Bounding

@dataclass
class BufferEntry:
    """Single interaction in the buffer."""
    
    role: str  # "user" or "assistant"
    content: str
    timestamp: datetime = field(default_factory=lambda: datetime.now(timezone.utc))
    tier: int = 0  # Classification tier when added
    scrubbed: bool = False  # Whether content was scrubbed before storage
    char_count: int = field(init=False)
    
    def __post_init__(self):
        self.char_count = len(self.content)


class RollingBuffer:
    """Rolling buffer with dual bounding.
    
    Bounded by:
    1. Max turns (number of user+assistant pairs)
    2. Max total characters (prevents massive payloads)
    
    When either limit is exceeded, oldest entries are evicted.
    """
    
    def __init__(
        self,
        max_turns: int = 5,
        max_chars: int = 4000,  # ~1000 tokens
        scrub_before_store: bool = True,
    ):
        self._max_turns = max_turns
        self._max_chars = max_chars
        self._entries: List[BufferEntry] = []
        self._total_chars = 0
        self._lock = threading.Lock()
    
    def add(
        self,
        role: str,
        content: str,
        tier: int = 0,
        scrubbed_content: Optional[str] = None,
    ) -> None:
        """Add an entry, evicting oldest if limits exceeded."""
        final_content = scrubbed_content if scrubbed_content is not None else content
        entry = BufferEntry(role=role, content=final_content, tier=tier)
        
        with self._lock:
            self._entries.append(entry)
            self._total_chars += entry.char_count
            
            # Enforce character limit (evict oldest until under limit)
            while self._total_chars > self._max_chars and len(self._entries) > 1:
                evicted = self._entries.pop(0)
                self._total_chars -= evicted.char_count
            
            # Enforce turn limit (evict oldest until under limit)
            while self.turn_count > self._max_turns and len(self._entries) > 1:
                evicted = self._entries.pop(0)
                self._total_chars -= evicted.char_count
    
    def format_for_handoff(self) -> str:
        """Format buffer with XML tags for injection into local model."""
        if not self._entries:
            return ""
        
        with self._lock:
            lines = []
            for entry in self._entries:
                role_tag = "user_message" if entry.role == "user" else "assistant_response"
                lines.append(f"<{role_tag}>{entry.content}{role_tag}>")
            
            return "\n".join(lines)

Design Decision: Dual Bounding vs. Simple Truncation

Trade-off considered:

Option	Pros	Cons
Simple truncation (`[:max_chars]`)	Easy to implement	Cuts mid-sentence, loses coherence
Turn-based only	Clean turn boundaries	Unbounded if turns are long
Dual bounding (turns + chars)	Predictable memory, clean boundaries	More complex eviction logic

My choice: Dual bounding — limit both turns (5) and total characters (4000).

Reasoning:

Evicting oldest complete entries preserves coherence (no mid-sentence cuts)
Character limit prevents a single massive turn from blowing up context
Both limits are configurable per deployment
~4000 chars ≈ 1000 tokens, safe for even small local models

Design Decision: XML Tags for Context Injection

Why XML tags instead of plain text?

Plain text (ambiguous):
User: Help me with X
Assistant: Sure, here's how...
User: Now do Y

XML-tagged (unambiguous):
Help me with X
Sure, here's how...
Now do Y

Reasoning: Local models (especially smaller ones) can confuse injected history with their own output. XML tags create clear structural boundaries that even 4B parameter models respect.

PII Scrubbing with Deterministic Placeholders

def scrub_content_for_buffer(
    content: str,
    detected_entities: List[dict],
) -> str:
    """Scrub sensitive entities with hash-based placeholders.
    
    Creates deterministic placeholders so the same entity gets
    the same placeholder across turns (maintains referential coherence).
    """
    if not detected_entities:
        return content
    
    scrubbed = content
    
    # Sort by length descending to handle overlapping matches
    sorted_entities = sorted(
        detected_entities,
        key=lambda e: len(e.get("value", "")),
        reverse=True,
    )
    
    for entity in sorted_entities:
        value = entity.get("value", "")
        entity_type = entity.get("type", "REDACTED")
        
        if value and value in scrubbed:
            # Deterministic placeholder from hash
            hash_suffix = hashlib.sha256(value.encode()).hexdigest()[:6]
            placeholder = f"[{entity_type.upper()}_{hash_suffix}]"
            scrubbed = scrubbed.replace(value, placeholder)
    
    return scrubbed

Why hash-based placeholders?

Approach	Example	Problem
Simple redaction	`[REDACTED]`	“Send to [REDACTED] and CC [REDACTED]” — which is which?
Numbered	`[PERSON_1]`, `[PERSON_2]`	Requires state tracking across turns
Hash-based	`[PERSON_a3f2c1]`	Same entity → same placeholder, stateless

The hash suffix preserves referential identity: if “John Smith” appears in turns 1 and 3, both become [PERSON_a3f2c1], so the local model understands they refer to the same entity.

Full Handoff System Prompt

The actual handoff injects a complete system prompt with capability guardrails:

def create_handoff_system_prompt(
    buffer: RollingBuffer,
    capability_guardrail: bool = True,
) -> str:
    """Create system prompt for local model handoff."""
    parts = []
    
    # Capability guardrail (what the local model cannot do)
    if capability_guardrail:
        parts.append("""
You are operating in a SECURE LOCAL environment with the following restrictions:
- You have NO access to the internet or web browsing
- You have NO access to external APIs or services
- You have NO access to databases or file systems
- You CANNOT make HTTP requests or fetch external data
- You MUST answer based solely on your training knowledge and the conversation context

If the user asks for anything requiring external access, politely explain 
that you cannot perform that action in this secure environment.

""")
    
    # Historical context from buffer
    history = buffer.format_for_handoff()
    if history:
        parts.append(f"""
The following is the conversation history from this session.

{history}

""")
    
    # Current request instructions
    parts.append("""
Respond to the user's current message below. Maintain conversational 
context from the history if provided. Be helpful, accurate, and concise.

""")
    
    return "\n".join(parts)

Design Decision: Capability Guardrails

Why inject capability restrictions?

When a session locks to local, the user might still ask “search the web for X” or “check my calendar.” Without guardrails, the local model might:

Hallucinate search results
Pretend it has capabilities it doesn’t
Confuse the user about what happened

The capability guardrail ensures the local model responds honestly: “I can’t access the web in this secure environment.”

Component 4: Backend Manager

The Problem

I needed to support multiple backends with different selection strategies:

Local: Multiple Ollama models (gemma3:4b, mistral) on the same machine
Cloud: Anthropic (claude-sonnet-4) and Google (gemini-2.0-flash) with configurable selection

Architecture: Separate Local and Cloud Management

class BackendManager:
    """Manages multiple inference backends with health checking and selection."""

    def __init__(
        self,
        config: LocalBackendsConfig,
        cloud_selection_strategy: Literal["primary_fallback", "round_robin"] = "primary_fallback",
        cloud_primary: str = "anthropic",
        cloud_fallback: str = "google",
    ):
        self._config = config
        self._local_backends: dict[str, OllamaBackend] = {}
        self._cloud_backends: dict[str, BaseBackend] = {}
        self._health_status: dict[str, bool] = {}
        
        # Cloud selection config
        self._cloud_selection_strategy = cloud_selection_strategy
        self._cloud_primary = cloud_primary
        self._cloud_fallback = cloud_fallback
        
        # Round-robin state (separate counters for local and cloud)
        self._cloud_rr_index: int = 0
        self._local_rr_index: int = 0
        self._lock = asyncio.Lock()

Selection Strategies

Local backends support three strategies (from config):

async def select_local_backend(
    self,
    strategy: Literal["priority", "round_robin", "latency_best"] | None = None,
) -> OllamaBackend | None:
    """Select a local backend based on the configured strategy."""
    strategy = strategy or self._config.selection_strategy
    healthy = self.get_healthy_local_backends()

    if not healthy:
        return None

    if strategy == "priority":
        # Sort by priority value from config, lowest wins
        sorted_backends = sorted(
            healthy,
            key=lambda b: next(
                e.priority for e in self._config.endpoints if e.name == b.endpoint_name
            ),
        )
        return sorted_backends[0]

    elif strategy == "round_robin":
        async with self._lock:
            backend = healthy[self._local_rr_index % len(healthy)]
            self._local_rr_index += 1
            return backend

    elif strategy == "latency_best":
        # TODO: Implement actual latency tracking
        return healthy[0]

    return healthy[0]

Cloud backends support two strategies:

def select_cloud_backend(self, preferred: str | None = None) -> BaseBackend | None:
    """Select a cloud backend based on configured strategy.

    Strategies:
    - primary_fallback: Try primary (anthropic), then fallback (google)
    - round_robin: Alternate between healthy backends
    """
    healthy = self.get_healthy_cloud_backends()

    if not healthy:
        return None

    # If preferred backend specified and healthy, use it
    if preferred and preferred in self._cloud_backends:
        backend = self._cloud_backends[preferred]
        if self._health_status.get(preferred, False):
            return backend

    # Apply selection strategy
    if self._cloud_selection_strategy == "round_robin":
        return self._select_cloud_round_robin(healthy)
    else:
        # primary_fallback (default)
        return self._select_cloud_primary_fallback(healthy)

Design Decision: Why Separate Local and Cloud Selection?

Trade-off considered:

Approach	Pros	Cons
Unified strategy	Simpler code, one enum	Cloud and local have different needs
Separate strategies	Tailored to each tier	More configuration surface

My choice: Separate strategies.

Reasoning:

Local models are interchangeable: gemma3:4b and mistral are both “good enough” — round-robin makes sense
Cloud models differ significantly: Anthropic Claude excels at reasoning; Gemini excels at speed and multimodal — primary/fallback lets me choose
Cost considerations: Cloud round-robin might accidentally route expensive requests to the pricier provider
Session stickiness: Cloud backends benefit from sticky routing (consistent model personality within a session)

Session-Sticky Backend Selection

The backend manager supports session stickiness — once a session uses a specific backend, subsequent requests prefer that backend:

async def generate_cloud(
    self,
    messages: list[dict],
    preferred_backend: str | None = None,
    sticky_backend: str | None = None,  # From session state
) -> tuple[InferenceResult, BaseBackend | None]:
    """Generate using cloud with optional stickiness."""
    
    # Sticky backend takes precedence over preferred
    effective_preferred = sticky_backend or preferred_backend
    
    backend = self.select_cloud_backend(effective_preferred)
    # ... rest of generation

Why stickiness matters:

Model personality consistency within a conversation
Avoids jarring style shifts mid-conversation
Round-robin still applies to new sessions

Health Checking

async def refresh_health(self) -> dict[str, bool]:
    """Refresh health status for all backends concurrently."""
    
    # Build tasks for all backends
    local_tasks = {
        name: backend.health_check()
        for name, backend in self._local_backends.items()
    }
    cloud_tasks = {
        name: backend.health_check()
        for name, backend in self._cloud_backends.items()
    }
    
    all_tasks = {**local_tasks, **cloud_tasks}
    
    # Run all health checks concurrently
    results = await asyncio.gather(*all_tasks.values(), return_exceptions=True)
    
    for name, result in zip(all_tasks.keys(), results):
        if isinstance(result, Exception):
            self._health_status[name] = False
        else:
            self._health_status[name] = result
    
    return self._health_status.copy()

For Ollama, the health check hits /api/tags and verifies the configured model is available:

async def health_check(self) -> bool:
    """Check if Ollama is healthy and the model is available."""
    try:
        client = await self._get_client()
        response = await client.get("/api/tags")
        response.raise_for_status()

        data = response.json()
        models = [m["name"] for m in data.get("models", [])]

        # Check if our configured model is available
        model_available = any(
            self._endpoint.model in m or m in self._endpoint.model
            for m in models
        )
        return model_available
    except Exception:
        return False

Design Decision: Ping vs. Inference Health Check

Trade-off considered:

Method	Pros	Cons
HTTP ping (`/api/tags`)	Fast (~10ms), low overhead	Doesn’t verify model is loaded in memory
Minimal inference	Verifies end-to-end	Slow (~2s), wastes compute

My choice: HTTP ping to Ollama’s /api/tags endpoint, then verify model is in the list.

Reasoning: Ollama keeps models in memory after first load. If the HTTP endpoint responds and lists our model, it’s ready. Full inference check would add ~2s per model per check cycle.

Cross-Tier Failover

The backend manager supports failing over from cloud to local if all cloud backends fail:

async def generate_routed(
    self,
    messages: list[dict],
    route: Literal["local", "cloud"],
    preferred_cloud_backend: str | None = None,
    sticky_backend: str | None = None,
) -> tuple[InferenceResult, BaseBackend | None, Literal["local", "cloud"]]:
    """Generate with automatic failover."""
    
    if route == "cloud":
        result, backend = await self.generate_cloud(
            messages, preferred_cloud_backend, sticky_backend
        )
        
        if result.error and self._config.failover_enabled:
            # Fallback to local if cloud fails
            logger.info("Cloud failed, falling back to local")
            result, backend = await self.generate(messages, sticky_backend=sticky_backend)
            return result, backend, "local"  # Note: actual route changed
        
        return result, backend, "cloud"
    else:
        result, backend = await self.generate(messages, sticky_backend=sticky_backend)
        return result, backend, "local"

Why cloud → local failover?

Graceful degradation: If Anthropic and Google both have outages, users still get responses
Tier 0-1 traffic is safe for local: No PII, so local fallback doesn’t violate privacy
Return actual route: Caller knows the response came from local, not cloud (important for billing, metrics)

Component 5: Shadow Mode

The Problem

How do I know if local inference is “good enough” to replace cloud for certain traffic?

My Solution: Sequential Execution with Background Comparison

User Request (Tier 0 or 1)
         │
         ▼
    Cloud Backend ────────────▶ Response to User (immediate)
         │
         │ (cloud result captured)
         │
         ▼
    ┌──────────────────┐
    │ Fire-and-Forget  │
    │ Background Task  │
    └────────┬─────────┘
             │
             ▼
    Local Backend ──────────▶ Compare ──────────▶ Log Metrics
    (async, non-blocking)

Key insight: Shadow mode is NOT truly parallel. Cloud executes first, returns to user immediately, THEN the shadow task is triggered with the cloud result passed in. This ensures zero latency impact on the user.

Implementation

@dataclass
class ShadowConfig:
    """Configuration for shadow mode."""
    
    enabled: bool = False
    
    # Which tiers to shadow (only safe tiers)
    shadow_tiers: list[int] = field(default_factory=lambda: [0, 1])
    
    # Sampling rate (0.0 to 1.0)
    sample_rate: float = 1.0  # 100% of eligible requests
    
    # Similarity scoring
    similarity_enabled: bool = True
    similarity_model: str = "fast"  # "fast", "balanced", "accurate"
    
    # Timeouts
    local_timeout_seconds: float = 60.0
    
    # Storage
    store_responses: bool = False  # Memory heavy
    max_stored_results: int = 1000


class ShadowRunner:
    """Runs shadow mode comparisons between local and cloud models.
    
    Shadow mode is non-blocking - cloud response is returned immediately
    while local inference runs in the background.
    """
    
    def __init__(self, config: ShadowConfig | None = None):
        self.config = config or ShadowConfig()
        self._similarity = get_similarity_scorer()
        
        # Results storage (circular buffer)
        self._results: list[ShadowResult] = []
        
        # Background tasks tracking
        self._pending_tasks: set[asyncio.Task] = set()
        
        # Internal metrics
        self._total_shadows = 0
        self._successful_shadows = 0
        self._quality_matches = 0
        self._total_cost_savings = 0.0
    
    def should_shadow(self, privacy_tier: int) -> bool:
        """Determine if this request should be shadowed."""
        if not self.config.enabled:
            return False
        
        # Only shadow safe tiers (0 and 1 by default)
        if privacy_tier not in self.config.shadow_tiers:
            return False
        
        # Apply sampling rate
        if self.config.sample_rate < 1.0:
            import random
            if random.random() > self.config.sample_rate:
                return False
        
        return True
    
    async def run_shadow(
        self,
        request_id: str,
        messages: list[dict],
        cloud_result: InferenceResult,    # Cloud result already available
        cloud_backend_name: str,
        cloud_latency_ms: float,
        privacy_tier: int,
        backend_manager: BackendManager,
    ) -> None:
        """Run shadow inference on local backend.
        
        This method is fire-and-forget — it schedules the shadow
        task and returns immediately.
        """
        task = asyncio.create_task(
            self._run_shadow_async(
                request_id=request_id,
                messages=messages,
                cloud_result=cloud_result,
                cloud_backend_name=cloud_backend_name,
                cloud_latency_ms=cloud_latency_ms,
                privacy_tier=privacy_tier,
                backend_manager=backend_manager,
            )
        )
        
        # Track task for cleanup
        self._pending_tasks.add(task)
        task.add_done_callback(self._pending_tasks.discard)

Design Decision: Why Sequential, Not Parallel?

Trade-off considered:

Approach	User latency impact	Resource usage	Implementation
True parallel (both start together)	+0ms (but ties up local)	2x concurrent	Complex coordination
Sequential (cloud first, then shadow)	+0ms	1x then 1x	Simple, clean

My choice: Sequential with fire-and-forget.

Reasoning:

User doesn’t wait: Cloud returns immediately; shadow runs after
Simpler state: Shadow receives cloud result directly, no coordination needed
Resource efficient: Local inference only starts after cloud completes
Timeout is per-shadow: If local times out, we just lose that data point

Shadow Result Structure

@dataclass
class ShadowResult:
    """Result of a shadow mode comparison."""
    
    # Identifiers
    shadow_id: str
    request_id: str
    timestamp: str
    
    # Models used
    cloud_model: str
    local_model: str
    cloud_backend: str
    local_backend: str
    
    # Latency comparison
    cloud_latency_ms: float
    local_latency_ms: float
    latency_diff_ms: float  # local - cloud (negative = local faster)
    
    # Cost comparison
    cloud_cost_usd: float
    local_cost_usd: float  # Usually 0 for local
    cost_savings_usd: float
    
    # Quality comparison
    similarity: SimilarityResult | None = None
    
    @property
    def is_quality_match(self) -> bool:
        """Check if local quality matches cloud (threshold: 0.75)."""
        if self.similarity is None:
            return False
        return self.similarity.is_quality_match
    
    @property
    def local_is_faster(self) -> bool:
        """Check if local was faster than cloud."""
        return self.latency_diff_ms < 0

Similarity Computation with Sentence Transformers

class SimilarityScorer:
    """Computes semantic similarity using sentence-transformers."""
    
    # Available models (speed vs accuracy tradeoff)
    MODELS = {
        "fast": "all-MiniLM-L6-v2",        # 80MB, 384 dims
        "balanced": "all-mpnet-base-v2",    # 420MB, 768 dims
        "accurate": "all-roberta-large-v1", # 1.3GB, 1024 dims
    }
    
    # Similarity interpretation thresholds
    THRESHOLDS = {
        "high": 0.85,    # Responses are very similar
        "medium": 0.70,  # Responses convey similar meaning
        "low": 0.0,      # Responses differ significantly
    }
    
    async def compute_similarity(
        self,
        cloud_response: str,
        local_response: str,
    ) -> SimilarityResult:
        """Compute semantic similarity between two responses."""
        
        # Load model lazily
        if not self._initialized:
            await self.initialize()
        
        # Compute embeddings in thread pool (CPU-bound)
        loop = asyncio.get_event_loop()
        embeddings = await loop.run_in_executor(
            None,
            lambda: self._model.encode(
                [cloud_response, local_response],
                convert_to_numpy=True,
                normalize_embeddings=True,  # Pre-normalize for dot product
            )
        )
        
        # Cosine similarity (embeddings are normalized, so dot product suffices)
        similarity = float(np.dot(embeddings[0], embeddings[1]))
        
        # Clamp to [0, 1] and interpret
        similarity = max(0.0, min(1.0, similarity))
        
        if similarity >= self.THRESHOLDS["high"]:
            interpretation = "high"
        elif similarity >= self.THRESHOLDS["medium"]:
            interpretation = "medium"
        else:
            interpretation = "low"
        
        return SimilarityResult(
            similarity_score=similarity,
            interpretation=interpretation,
            # ... other fields
        )

Design Decision: Why Sentence Transformers?

Trade-off considered:

Approach	Speed	Quality	Dependencies
Jaccard similarity	<1ms	Poor (surface only)	None
TF-IDF + cosine	~5ms	Moderate	sklearn
Sentence transformers	~50ms	Excellent (semantic)	torch, transformers

My choice: Sentence transformers with model tiers.

Reasoning:

Semantic similarity matters: “The cat sat on the mat” and “A feline rested on the rug” should score high
Model flexibility: “fast” (80MB) for development, “accurate” (1.3GB) for production validation
Lazy loading: Model only loads when first shadow comparison runs
Thread pool execution: Embeddings computed off the event loop to avoid blocking

Quality Match Threshold

@property
def is_quality_match(self) -> bool:
    """Check if responses are semantically similar enough."""
    return self.similarity_score >= 0.75

Why 0.75?

Threshold	Meaning	Risk
0.90+	Near-identical	Too strict, almost never matches
0.80-0.90	Very similar	Good for production validation
0.70-0.80	Similar meaning	Acceptable for most use cases
<0.70	Different responses	Quality concern

I chose 0.75 as a balance — strict enough to catch quality regressions, loose enough to tolerate stylistic differences between models.

Internal Metrics for Controller

The ShadowRunner maintains internal metrics that the controller reads directly (no Prometheus dependency):

def get_metrics(self) -> dict:
    """Get shadow mode metrics for controller."""
    quality_rate = (
        self._quality_matches / self._successful_shadows
        if self._successful_shadows > 0 else 0.0
    )
    
    return {
        "total_shadows": self._total_shadows,
        "successful_shadows": self._successful_shadows,
        "quality_matches": self._quality_matches,
        "quality_match_rate": round(quality_rate, 4),
        "total_cost_savings_usd": round(self._total_cost_savings, 4),
        "pending_tasks": len(self._pending_tasks),
        "stored_results": len(self._results),
    }

Component 6: Closed-Loop Controller

The Problem

Shadow mode generates data. But who acts on it?

Manual review doesn’t scale. I wanted a system that could observe patterns and generate routing recommendations automatically.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    Closed-Loop Controller                        │
│                                                                  │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐          │
│  │   Metrics   │    │    Rule     │    │ Recommend-  │          │
│  │   Reader    │───▶│   Engine    │───▶│   ations    │          │
│  │             │    │             │    │             │          │
│  │ • Read from │    │ • Evaluate  │    │ • Generate  │          │
│  │   Shadow    │    │   per-tier  │    │ • Log       │          │
│  │   Runner    │    │ • Threshold │    │ • (Act)*    │          │
│  │   directly  │    │ • Drift     │    │             │          │
│  └─────────────┘    └─────────────┘    └─────────────┘          │
│                                                                  │
│  *Act is disabled in "observe" mode (v1)                         │
│  No Prometheus dependency — pure Python state access             │
└──────────────────────────────────────────────────────────────────┘

Key design point: The controller reads metrics directly from ShadowRunner._results, NOT from Prometheus. This eliminates an external dependency and simplifies deployment.

Metrics Reader: Direct State Access

class MetricsReader:
    """Reads and aggregates metrics from ShadowRunner internal state.
    
    No Prometheus dependency — pure Python state access.
    Uses timestamped filtering for rolling window calculations.
    """
    
    def __init__(self, window_seconds: int = 300):
        """Initialize metrics reader.
        
        Args:
            window_seconds: Rolling window size in seconds (default: 5 min)
        """
        self._window_seconds = window_seconds
        self._shadow_runner: ShadowRunner | None = None
        self._previous_tier_metrics: dict[int, TierMetrics] = {}
    
    def set_shadow_runner(self, runner: ShadowRunner) -> None:
        """Set the shadow runner to read metrics from."""
        self._shadow_runner = runner
    
    def get_tier_metrics(
        self,
        tier: int,
        window_seconds: int | None = None,
        quality_threshold: float = 0.85,
    ) -> TierMetrics:
        """Get aggregated metrics for a specific tier."""
        window = window_seconds or self._window_seconds
        cutoff = datetime.now(timezone.utc) - timedelta(seconds=window)
        
        # Read directly from shadow runner's internal results
        samples = self._collect_samples(tier, cutoff)
        
        if not samples:
            return TierMetrics(tier=tier)
        
        # Calculate statistics
        similarities = [s.similarity_score for s in samples]
        
        return TierMetrics(
            tier=tier,
            sample_count=len(samples),
            avg_similarity=mean(similarities),
            min_similarity=min(similarities),
            max_similarity=max(similarities),
            p50_similarity=median(similarities),
            quality_match_rate=len([s for s in samples if s.similarity_score >= quality_threshold]) / len(samples),
            # ... other fields
        )
    
    def _collect_samples(self, tier: int, cutoff: datetime) -> list[MetricsSample]:
        """Collect samples by reading ShadowRunner._results directly."""
        if not self._shadow_runner:
            return []
        
        samples = []
        results = getattr(self._shadow_runner, "_results", [])
        
        for result in results:
            result_time = self._parse_timestamp(result.timestamp)
            if result_time and result_time >= cutoff and result.privacy_tier == tier:
                samples.append(MetricsSample(
                    tier=result.privacy_tier,
                    similarity_score=result.similarity.similarity_score if result.similarity else 0.0,
                    latency_diff_ms=result.latency_diff_ms,
                    cost_savings_usd=result.cost_savings_usd,
                    is_quality_match=result.is_quality_match,
                    timestamp=result_time,
                ))
        
        return samples

Design Decision: Why Not Prometheus?

Trade-off considered:

Approach	Pros	Cons
Query Prometheus	Industry standard, dashboards built-in	External dependency, network latency
Read ShadowRunner directly	Zero dependencies, faster, simpler	Tighter coupling, no persistence

My choice: Direct state access.

Reasoning:

Simplicity: Controller runs in same process as ShadowRunner — no network hop
Deployment: One less service to configure and maintain
Latency: Sub-millisecond access vs. Prometheus query latency
Portfolio scope: Demonstrating the pattern is sufficient; production might add Prometheus for persistence

Rule Engine

class RecommendationType(Enum):
    """Types of routing recommendations."""
    
    ROUTE_TO_LOCAL = "route_to_local"       # Quality good, save money
    KEEP_ON_CLOUD = "keep_on_cloud"         # Quality insufficient
    DRIFT_ALERT = "drift_alert"             # Quality degraded from previous
    INSUFFICIENT_DATA = "insufficient_data"  # Need more samples
    NO_CHANGE = "no_change"                  # Current config is optimal


class Confidence(Enum):
    """Confidence level for recommendations."""
    
    HIGH = "high"      # >500 samples, stable metrics
    MEDIUM = "medium"  # 100-500 samples
    LOW = "low"        # <100 samples


class RuleEngine:
    """Evaluates metrics against rules to generate recommendations."""
    
    def __init__(self, config: ControllerConfig):
        self._config = config
    
    def evaluate(
        self,
        current: TierMetrics,
        previous: TierMetrics | None = None,
    ) -> Recommendation:
        """Evaluate metrics for a tier and generate recommendation."""
        
        ctx = RuleContext(
            current_metrics=current,
            previous_metrics=previous,
            config=self._config,
            tier=current.tier,
        )
        
        # Rule priority: check in order
        
        # 1. Insufficient data
        if self._check_insufficient_data(ctx):
            min_samples = ctx.threshold_config.get("min_samples", 100)
            return Recommendation(
                tier=ctx.tier,
                recommendation=RecommendationType.INSUFFICIENT_DATA,
                reason=f"Only {current.sample_count} samples, need {min_samples}",
                confidence=Confidence.LOW,
            )
        
        # 2. Drift alert (quality degradation)
        if self._check_drift(ctx):
            delta = previous.avg_similarity - current.avg_similarity
            return Recommendation(
                tier=ctx.tier,
                recommendation=RecommendationType.DRIFT_ALERT,
                reason=f"Quality degraded by {delta:.1%}",
                confidence=self._get_confidence(current.sample_count),
                previous_similarity=previous.avg_similarity,
                similarity_delta=-delta,
            )
        
        # 3. Route to local (quality good + cost savings)
        if self._check_route_to_local(ctx):
            return Recommendation(
                tier=ctx.tier,
                recommendation=RecommendationType.ROUTE_TO_LOCAL,
                reason=f"Similarity {current.avg_similarity:.1%} exceeds threshold",
                confidence=self._get_confidence(current.sample_count),
                potential_savings_usd=current.total_cost_savings_usd,
            )
        
        # 4. Keep on cloud (quality insufficient)
        if self._check_keep_on_cloud(ctx):
            min_similarity = ctx.threshold_config.get("min_similarity", 0.85)
            gap = min_similarity - current.avg_similarity
            return Recommendation(
                tier=ctx.tier,
                recommendation=RecommendationType.KEEP_ON_CLOUD,
                reason=f"Similarity {current.avg_similarity:.1%} is {gap:.1%} below threshold",
                confidence=self._get_confidence(current.sample_count),
            )
        
        # 5. No change needed
        return Recommendation(
            tier=ctx.tier,
            recommendation=RecommendationType.NO_CHANGE,
            reason="Current configuration is optimal",
            confidence=self._get_confidence(current.sample_count),
        )
    
    def _get_confidence(self, sample_count: int) -> Confidence:
        """Determine confidence level based on sample count."""
        if sample_count >= 500:
            return Confidence.HIGH
        elif sample_count >= 100:
            return Confidence.MEDIUM
        return Confidence.LOW

Controller Configuration

@dataclass 
class ControllerConfig:
    """Configuration for the closed-loop controller."""
    
    enabled: bool = False
    mode: Literal["observe", "auto"] = "observe"
    evaluation_interval_seconds: int = 60
    window_seconds: int = 300  # 5 minute rolling window
    
    # Per-tier thresholds (different tiers can have different quality bars)
    tier_thresholds: dict[int, dict] = field(default_factory=lambda: {
        0: {"min_similarity": 0.85, "min_samples": 100},
        1: {"min_similarity": 0.80, "min_samples": 100},
    })
    
    # Alert thresholds
    drift_threshold: float = 0.10  # Alert if similarity drops by 10%
    cost_savings_threshold_usd: float = 50.0  # Min savings to recommend

Controller Lifecycle

class ClosedLoopController:
    """Closed-loop controller for routing optimization.
    
    Runs as asyncio background task within FastAPI lifecycle.
    """
    
    async def start(self) -> None:
        """Start the controller background task."""
        if not self._config.enabled:
            logger.info("Controller disabled, not starting")
            return
        
        self._running = True
        self._task = asyncio.create_task(self._run_loop())
    
    async def _run_loop(self) -> None:
        """Main controller loop — runs evaluation at configured interval."""
        while self._running:
            try:
                await self._evaluate()
            except Exception as e:
                logger.error("Controller evaluation failed", error=str(e))
            
            await asyncio.sleep(self._config.evaluation_interval_seconds)
    
    async def _evaluate(self) -> None:
        """Run a single evaluation cycle."""
        self._total_evaluations += 1
        self._last_evaluation = datetime.utcnow()
        
        # Collect metrics for all tiers
        tier_metrics = self._metrics_reader.get_all_tier_metrics()
        
        if not tier_metrics:
            logger.debug("No shadow metrics available for evaluation")
            return
        
        # Evaluate each tier
        for tier, metrics in tier_metrics.items():
            previous = self._metrics_reader.get_previous_metrics(tier)
            recommendation = self._rule_engine.evaluate(metrics, previous)
            
            # Log recommendation (structured for Loki)
            self._log_recommendation(recommendation, metrics)
        
        # Store current metrics for next evaluation's drift detection
        self._metrics_reader.store_current_as_previous(tier_metrics)
    
    async def force_evaluate(self) -> dict:
        """Force an immediate evaluation (for testing/debugging)."""
        await self._evaluate()
        return {
            "evaluation_number": self._total_evaluations,
            "recommendations": {
                tier: rec.to_dict() 
                for tier, rec in self._current_recommendations.items()
            },
        }

Design Decision: Observe vs Auto Mode

Trade-off considered:

Mode	Behavior	Risk
Observe	Log recommendations only	None (human reviews)
Auto	Adjust routing weights automatically	Runaway feedback loops

My choice: Observe mode only for v1.

Reasoning: Auto-routing is powerful but dangerous. A bug in similarity computation could cause the controller to route all traffic to local, degrading user experience. For a portfolio project, demonstrating the architecture is sufficient. Production deployment would require:

Extensive testing of the feedback loop
Circuit breakers to prevent runaway changes
Gradual rollout (route 1% to local, measure, increase)
Human approval gates for significant changes

Component 7: Observability Stack

Architecture: Three Pillars

The gateway implements all three observability pillars:

Pillar	Library	Export Target	Purpose
Metrics	`prometheus_client`	Prometheus	Counters, histograms, gauges
Traces	OpenTelemetry	Tempo (OTLP)	Request flow, latency breakdown
Logs	`structlog`	Loki (JSON)	Events, debugging

Note: Metrics use prometheus_client (Prometheus-native), NOT OpenTelemetry. This is intentional — Prometheus client is more mature and Grafana dashboards work seamlessly with it.

Metrics: Prometheus-Native

from prometheus_client import Counter, Histogram, Gauge, Info

# =============================================================================
# Request Metrics
# =============================================================================
REQUESTS_TOTAL = Counter(
    "sentinel_requests_total",
    "Total number of inference requests",
    labelnames=["route", "backend", "endpoint", "model", "tier", "status"]
)

REQUESTS_IN_PROGRESS = Gauge(
    "sentinel_requests_in_progress",
    "Number of requests currently being processed",
    labelnames=["backend"]
)

# =============================================================================
# Latency Metrics (LLM-specific histograms)
# =============================================================================
LATENCY_BUCKETS_SEC = [0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0, 60.0]

TTFT_SECONDS = Histogram(
    "sentinel_ttft_seconds",
    "Time to first token in seconds",
    labelnames=["backend", "endpoint", "model"],
    buckets=LATENCY_BUCKETS_SEC
)

ITL_SECONDS = Histogram(
    "sentinel_itl_seconds",
    "Inter-token latency in seconds",
    labelnames=["backend", "endpoint", "model"],
    buckets=[0.005, 0.010, 0.020, 0.030, 0.050, 0.075, 0.100, 0.150, 0.200, 0.500]
)

TPOT_SECONDS = Histogram(
    "sentinel_tpot_seconds",
    "Time per output token in seconds",
    labelnames=["backend", "endpoint", "model"],
    buckets=[0.005, 0.010, 0.020, 0.030, 0.050, 0.075, 0.100, 0.150, 0.200, 0.500]
)

CLASSIFICATION_LATENCY_SECONDS = Histogram(
    "sentinel_classification_latency_seconds",
    "Privacy classification latency in seconds",
    labelnames=["detection_method"],
    buckets=[0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1]
)

# =============================================================================
# Shadow Mode Metrics
# =============================================================================
SHADOW_REQUESTS_TOTAL = Counter(
    "sentinel_shadow_requests_total",
    "Total shadow mode comparisons",
    labelnames=["status"]  # success, timeout, error
)

SHADOW_SIMILARITY_SCORE = Histogram(
    "sentinel_shadow_similarity_score",
    "Semantic similarity score between cloud and local outputs",
    buckets=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1.0]
)

SHADOW_QUALITY_MATCH = Counter(
    "sentinel_shadow_quality_match_total",
    "Shadow comparisons where local quality matched cloud",
    labelnames=["tier"]
)

# =============================================================================
# Cost Metrics
# =============================================================================
COST_USD_TOTAL = Counter(
    "sentinel_cost_usd_total",
    "Total inference cost in USD",
    labelnames=["backend", "model"]
)

COST_SAVINGS_USD_TOTAL = Counter(
    "sentinel_cost_savings_usd_total",
    "Total cost savings from local routing in USD"
)

Helper Functions for Clean Recording

Rather than scattering metric updates throughout the codebase, I centralized them:

def record_request(
    route: str,
    backend: str,
    endpoint: str,
    model: str,
    tier: int,
    status: str = "success"
) -> None:
    """Record a completed request."""
    REQUESTS_TOTAL.labels(
        route=route,
        backend=backend,
        endpoint=endpoint,
        model=model,
        tier=str(tier),
        status=status
    ).inc()


def record_latencies(
    backend: str,
    endpoint: str,
    model: str,
    ttft_ms: float | None,
    itl_ms: float | None,
    tpot_ms: float | None,
    total_ms: float
) -> None:
    """Record all latency metrics for a request."""
    labels = {"backend": backend, "endpoint": endpoint, "model": model}
    
    if ttft_ms is not None:
        TTFT_SECONDS.labels(**labels).observe(ttft_ms / 1000)
    
    if itl_ms is not None:
        ITL_SECONDS.labels(**labels).observe(itl_ms / 1000)
    
    if tpot_ms is not None:
        TPOT_SECONDS.labels(**labels).observe(tpot_ms / 1000)
    
    INFERENCE_LATENCY_SECONDS.labels(**labels).observe(total_ms / 1000)


def record_shadow_result(
    status: str,
    tier: int,
    similarity_score: float | None = None,
    latency_diff_ms: float | None = None,
    cost_savings_usd: float | None = None,
    is_quality_match: bool = False,
) -> None:
    """Record shadow mode comparison results."""
    SHADOW_REQUESTS_TOTAL.labels(status=status).inc()
    
    if status == "success":
        if similarity_score is not None:
            SHADOW_SIMILARITY_SCORE.observe(similarity_score)
        
        if is_quality_match:
            SHADOW_QUALITY_MATCH.labels(tier=str(tier)).inc()

Traces: OpenTelemetry to Tempo

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

def setup_tracing(
    service_name: str = "inference-sentinel",
    otlp_endpoint: str | None = None,
) -> None:
    """Initialize OpenTelemetry tracing."""
    
    resource = Resource.create({
        SERVICE_NAME: service_name,
        "service.namespace": "inference-sentinel",
    })
    
    provider = TracerProvider(resource=resource)
    
    # Export to Tempo via OTLP
    if otlp_endpoint:
        otlp_exporter = OTLPSpanExporter(endpoint=otlp_endpoint, insecure=True)
        provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
    
    trace.set_tracer_provider(provider)


# Context manager for spans
@contextmanager
def trace_span(
    name: str,
    attributes: dict[str, Any] | None = None
) -> Generator[Span, None, None]:
    """Create a traced span context."""
    tracer = get_tracer()
    with tracer.start_as_current_span(name) as span:
        if attributes:
            for key, value in attributes.items():
                if value is not None:
                    span.set_attribute(key, value)
        yield span


# Decorator for easy function tracing
def traced(name: str | None = None):
    """Decorator to trace a function."""
    def decorator(func):
        span_name = name or func.__name__
        
        async def async_wrapper(*args, **kwargs):
            with trace_span(span_name) as span:
                try:
                    result = await func(*args, **kwargs)
                    span.set_status(Status(StatusCode.OK))
                    return result
                except Exception as e:
                    span.set_status(Status(StatusCode.ERROR, str(e)))
                    span.record_exception(e)
                    raise
        
        # ... sync wrapper similar
        return async_wrapper if asyncio.iscoroutinefunction(func) else sync_wrapper
    return decorator

Logs: Structured with structlog

import structlog

def setup_logging(
    log_level: str = "INFO",
    json_logs: bool = True,  # JSON for Loki, console for dev
) -> None:
    """Configure structured logging."""
    
    processors = [
        structlog.stdlib.add_logger_name,
        add_log_level,
        add_timestamp,
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
    ]
    
    if json_logs:
        processors.append(structlog.processors.JSONRenderer())
    else:
        processors.append(structlog.dev.ConsoleRenderer(colors=True))
    
    structlog.configure(
        processors=processors,
        wrapper_class=structlog.stdlib.BoundLogger,
        logger_factory=structlog.stdlib.LoggerFactory(),
    )

Helper Logger Classes

Domain-specific loggers for cleaner call sites:

class InferenceLogger:
    """Helper for logging inference-related events."""
    
    def __init__(self, logger_name: str = "sentinel.inference"):
        self.logger = get_logger(logger_name)
    
    def request_started(
        self,
        request_id: str,
        route: str,
        backend: str,
        tier: int,
        tier_label: str,
        entities: list[str]
    ) -> None:
        """Log when an inference request starts."""
        self.logger.info(
            "Inference request started",
            request_id=request_id,
            route=route,
            backend=backend,
            privacy_tier=tier,
            privacy_tier_label=tier_label,
            entities_detected=entities
        )
    
    def request_completed(
        self,
        request_id: str,
        backend: str,
        model: str,
        total_tokens: int,
        latency_ms: float,
        cost_usd: float,
    ) -> None:
        """Log when an inference request completes."""
        self.logger.info(
            "Inference request completed",
            request_id=request_id,
            backend=backend,
            model=model,
            total_tokens=total_tokens,
            latency_ms=round(latency_ms, 2),
            cost_usd=round(cost_usd, 6),
        )


class ClassificationLogger:
    """Helper for logging classification events."""
    
    def sensitive_content_detected(
        self,
        tier: int,
        tier_label: str,
        entities: list[dict]
    ) -> None:
        """Log when sensitive content is detected (tier >= 2)."""
        if tier >= 2:
            # Don't log the actual content, just metadata
            self.logger.warning(
                "Sensitive content detected",
                privacy_tier=tier,
                privacy_tier_label=tier_label,
                entity_types=[e.get("entity_type") for e in entities],
                action="routing_to_local"
            )

The Grafana Dashboards

Two dashboards provide operational visibility:

Real-time monitoring: request rate, TTFT/ITL by model, route distribution, backend health, and cost tracking

Operational health

Backend Health: gemma and mistral both healthy
Route Distribution: 70% local, 30% cloud
Request Share by Model: Traffic distribution across all backends
Per-model TTFT, ITL, and TPOT
Cost accumulation

Controller Decisions Data: Similarity Score, Shadow Comparison, Cost Savings and Routing Recommendations

Controller Dashboard — ML/quality metrics

Similarity score trends
Shadow comparison counts
Latency differential (local - cloud)
Cost savings over time
Routing recommendations log

Design Decision: Metric Cardinality

Trade-off considered: More labels = more granular data, but also more time series (cardinality explosion).

My approach:

route: 2 values (local, cloud)
backend: 4 values (ollama, anthropic, google, unknown)
endpoint: ~4 values (gemma, mistral, anthropic, google)
model: ~6 values (bounded by configured models)
tier: 4 values (0, 1, 2, 3)
status: 2 values (success, error)

Total theoretical cardinality: 2 × 4 × 4 × 6 × 4 × 2 = 1,536 series. Acceptable for a single-instance deployment.

Lesson learned: I added the model label to REQUESTS_TOTAL late, which broke existing Grafana queries. Design your metrics schema before writing dashboards.

Deployment Architecture

Docker Compose Stack

services:
  sentinel:       # The gateway (FastAPI)
  prometheus:     # Metrics storage
  grafana:        # Visualization
  loki:           # Log aggregation  
  tempo:          # Distributed tracing
  
  # Optional - only with --profile containerized
  ollama:         # Local inference (if no native Ollama)

Default setup: Ollama runs natively on the Mac Mini M4 to leverage Metal GPU acceleration. The gateway connects via host.docker.internal:11434.

Containerized option: For environments without native Ollama, run with the profile flag:

docker-compose --profile containerized up -d

This pulls and runs ollama/ollama:latest with gemma3:4b and mistral pre-loaded.

Design Decision: Why Not Kubernetes?

Trade-off considered:

Deployment	Pros	Cons
Docker Compose	Simple, local-friendly	Single node only
Kubernetes	Scalable, production-grade	Complexity overhead

My choice: Docker Compose for v1.

Reasoning:

Primary use case is local inference on a single Mac Mini M4
K8s manifests are planned for Phase 6
Compose is sufficient to demonstrate the architecture
Lower barrier to entry for people trying the project

Configuration Philosophy

Environment Variables vs .env File

I use pydantic-settings which supports both:

class Settings(BaseSettings):
    model_config = SettingsConfigDict(
        env_prefix="SENTINEL_",
        env_nested_delimiter="__",
        env_file=".env",
        env_file_encoding="utf-8",
        extra="ignore",
    )
    
    # Nested config classes
    local: LocalBackendsConfig
    cloud: CloudBackendsConfig
    cloud_selection: CloudSelectionConfig
    session: SessionConfig
    shadow: ShadowConfig
    controller: ControllerSettings
    telemetry: TelemetryConfig

Priority order:

Environment variables (highest)
.env file
Default values (lowest)

Lesson learned: This caused a subtle bug. SENTINEL_LOCAL__SELECTION_STRATEGY=priority in my shell was overriding selection_strategy: round_robin that I expected from defaults. The nested delimiter __ is powerful but can be surprising.

Semantic Validation with Pydantic

After getting bitten by invalid config values, I added range constraints:

class SessionConfig(BaseSettings):
    lock_threshold_tier: int = Field(
        default=2,
        ge=1,       # Must be >= 1
        le=3,       # Must be <= 3
        description="Minimum tier to trigger LOCAL_LOCKED"
    )
    ttl_seconds: int = Field(
        default=900,
        ge=60,      # At least 1 minute
        le=86400,   # At most 1 day
    )

This prevents lock_threshold_tier: 5 from being accepted — pydantic will raise a validation error at startup.

What I’d Do Differently

1. Start with Structured Logging

I added structlog late. Retrofitting structured logging is painful. Start with it from day one.

2. Integration Tests Earlier

Unit tests are great but don’t catch issues like “the regex works but the NER model isn’t loaded in Docker.” Integration tests against a real Ollama instance would have caught several bugs earlier.

3. Add Semantic Validators Early

I initially relied on pydantic’s type validation alone. After deploying with an invalid lock_threshold_tier: 5, I learned to add ge=/le= constraints from the start. Now invalid configs fail fast at startup.

4. Metrics Design Up Front

I added the model label to REQUESTS_TOTAL late, which broke existing Grafana queries. Design your metrics schema before writing dashboards.

Coming Up: Part 2

In Part 2, I’ll share the benchmarking results:

Classification accuracy across 1000+ test cases
Latency percentiles (TTFT, ITL, TPOT) by model
Shadow mode similarity distributions
Cost analysis: cloud vs local routing
Controller recommendation accuracy

GitHub: github.com/kraghavan/inference-sentinel

Questions or feedback? Connect with me on LinkedIn.

Karthika Raghavan

llm-d in Action — EPP Prefix Cache Routing and What It Actually Means

The Hardware Deserves More Than a Footnote

Grace Hopper — One Package, Two Chips, One Memory Bus

Why This Matters for LLM Inference

One Mental Model Before the Numbers

What These Experiments Actually Test

The Load Test — Simulating Multi-Tenant Traffic

The Results

Locust — 26,826 Requests, Zero Failures

What Grafana Showed

The Performance Dashboard — Read This One First

Prefix Cache Hit Rate — The Number Behind the Number

E2E Latency and Scheduler State

Prefill vs Decode — The Two-Phase Separation Made Visible

TTFT P99 — The Warmup Signature

Model Throughput and Queue State

The Complete Benchmark Reference

What This Proves — and What It Doesn’t

What Most Teams Get Wrong

Connecting the Dots Across This Series

What Comes Next

Deploying llm-d on a Cloud GPU — The 10 Things Nobody Tells You

What Is llm-d? (One Diagram, Then We Move On)

The Hardware

Before You Even Start: Getting the Instance

Gotcha 1: The Kubeconfig Belongs to Root

Gotcha 2: Lambda Ships Snap Helm and It Is Not Your Friend

Gotcha 3: Environment Variables Die When You Close the Terminal

Gotcha 4: The Default values.yaml Will Crash Your GPU

Gotcha 5: llm-d-cuda Is x86-Only

Gotcha 6: Gateway API CRDs Don’t Come With the Cluster

Gotcha 7: Use helm template | kubectl apply — K8s Debugging 101

Gotcha 8: The HTTPRoute Is Not Applied by helmfile

Gotcha 9: Immutable Selector Labels Mean You Can’t Upgrade In-Place

Gotcha 10: Port-Forward Processes Die Silently and Don’t Tell You

What Running Looks Like

The Observability Stack

The Survival Checklist

The Full Deployment Sequence

What This Unlocks

Treating the M4 Mac Mini Like a Production Inference Server (It Tried)

Why Apple Silicon for This Experiment?

The Setup

Why vllm-metal, Not Ollama?

Observability Stack

The kind Cluster and nginx Gateway

Experiment 1: Baseline — TTFT and TPOT Directly Measured

Experiment 2: Prefix Cache — 51% TTFT Reduction for Free

Experiment 3: KV Cache Pressure Under Concurrent Load

Experiment 4: Locust Mixed Traffic — Where the M4 Hits Its Ceiling

Experiment 5: vLLM vs Ollama — The Head-to-Head

Experiment 6: The nginx K8s Gateway — What the Proxy Layer Actually Costs

What Grafana Showed — The Full Picture

The Connection to Week 2

The Complete Benchmark Reference

What This Points To

What Is LLM Inference, Really? A Deep Technical Walkthrough

1. What Is Inference?

2. The Artifact: What’s Actually in That 10GB Download?

The “Manual” (The Header)

The “Hardware” (The Tensors)

Quantization: Shrinking the Brain

3. The Three Phases: A Map Before the Territory

4. Tokenization: Chopping Text Into Numbers

Is Tokenization CPU-Bound?

5. Prefill: The Model Reads Your Prompt

The Embedding Matrix

How Do the Model Weights Help Here?

6. Positional Embeddings: Teaching the Model About Order

How Is It Calculated?

CPU Bottleneck in Prefill?

7. The Transformer Layers: Where the Real Work Happens

8. Decoding: One Token at a Time, Forever

Decode Step 1: Predicting “on”

Decode Step 2: Predicting “the”

The Sampling Step (Where Creativity Lives)

9. Why Memory Is the Decode Bottleneck

10. KV Cache: The Most Important Data Structure in Inference

Prefix Caching: Free Speedups

Gotcha 5: `llm-d-cuda` Is x86-Only

Gotcha 7: Use `helm template | kubectl apply` — K8s Debugging 101