<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://kraghavan.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://kraghavan.github.io/" rel="alternate" type="text/html" /><updated>2026-04-22T20:35:38+00:00</updated><id>https://kraghavan.github.io/feed.xml</id><title type="html">Karthika Raghavan</title><subtitle>Engineering blog on distributed systems, LLM infrastructure, and observability</subtitle><author><name>Karthika Raghavan</name></author><entry><title type="html">llm-d in Action — EPP Prefix Cache Routing and What It Actually Means</title><link href="https://kraghavan.github.io/llm-infrastructure/inference/2026/04/19/llm-d-epp-prefix-cache-results.html" rel="alternate" type="text/html" title="llm-d in Action — EPP Prefix Cache Routing and What It Actually Means" /><published>2026-04-19T00:00:00+00:00</published><updated>2026-04-19T00:00:00+00:00</updated><id>https://kraghavan.github.io/llm-infrastructure/inference/2026/04/19/llm-d-epp-prefix-cache-results</id><content type="html" xml:base="https://kraghavan.github.io/llm-infrastructure/inference/2026/04/19/llm-d-epp-prefix-cache-results.html"><![CDATA[<p>In the <a href="/llm-infrastructure/inference/2026/04/17/vllm-llm-d-nvidia-gh200-experiment.html">previous post</a> I documented everything that broke during the llm-d deployment on a Lambda Labs GH200. Ten gotchas, two false starts, one ticking hourly bill.</p>

<p>This post is the payoff — and more than that, it’s an attempt to say something useful beyond “look at these numbers.” The numbers are good. But the more interesting question is what they reveal about inference architecture that applies regardless of which GPU you’re running on or which serving framework you’ve chosen.</p>

<p><strong>Hardware:</strong> Lambda Labs GH200 480GB, ARM64<br />
<strong>Model:</strong> Qwen/Qwen3-0.6B<br />
<strong>Stack:</strong> llm-d v0.4.0, vllm/vllm-openai:latest, Istio gateway, kube-prometheus-stack<br />
<strong>Load testing:</strong> Locust with tenant simulation<br />
<strong>Observability:</strong> Prometheus + Grafana (llm-d Performance Dashboard, llm-d vLLM Overview)</p>

<hr />

<h2 id="the-hardware-deserves-more-than-a-footnote">The Hardware Deserves More Than a Footnote</h2>

<p>Most inference benchmarks treat hardware as a footnote — “tested on an A100” — without explaining why the hardware choice matters architecturally. The GH200 is worth understanding properly because it represents a design direction that the industry is converging on, and it changes some assumptions about what’s possible in inference.</p>

<p>One quick note before the specs: the GH200 is technically a unified memory architecture — just like the M4 Mac Mini from the previous post. CPU and GPU share the same address space without copying data between them. The Mac Mini has 16GB at roughly 200 GB/s. The GH200 has 576GB at up to 4,000 GB/s.</p>

<p>The GH200 is technically unified memory — just like the M4 Mac Mini. One costs $799. The other costs $2.29 per hour and will make you feel considerably better about your Mac Mini purchase. Same architectural principle, considerably different ambitions — and as the numbers in this post will show, considerably different outcomes.</p>

<h3 id="grace-hopper--one-package-two-chips-one-memory-bus">Grace Hopper — One Package, Two Chips, One Memory Bus</h3>

<p>The GH200 is not a GPU. It’s a <strong>Grace Hopper Superchip</strong> — NVIDIA’s Grace CPU (72 ARM Neoverse V2 cores) and an H100 Hopper GPU die, connected on the same package via <strong>NVLink-C2C</strong> (chip-to-chip).</p>

<figure style="max-width:720px;margin:2rem auto;text-align:center;">
  <img src="/assets/images/llm-inference/grace-hopper-chip-nvidia.png" alt="Architecture of the NVIDIA GH200 Grace Hopper Superchip showing Grace CPU and Hopper GPU connected via NVLink-C2C" style="width:100%;" />
  <figcaption style="font-size:0.85rem;color:#888;margin-top:0.5rem;">
    NVIDIA GH200 Grace Hopper Superchip architecture. The Grace CPU (72 ARM Neoverse V2 cores, 480GB LPDDR5X)
    and Hopper GPU (H100, 96GB HBM3e) are connected on the same package via NVLink-C2C at 900 GB/s bidirectional —
    7× the bandwidth of PCIe Gen5. This chip-to-chip interconnect is what makes unified CPU+GPU memory a practical
    reality rather than a marketing claim.
    Source: <a href="https://www.nvidia.com/en-us/data-center/grace-hopper-superchip/">NVIDIA GH200 product page</a>.
  </figcaption>
</figure>

<p>The critical number is the <strong>NVLink-C2C bandwidth: 900 GB/s bidirectional</strong>. For context:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Interconnect              Bandwidth
──────────────────────────────────────────────
PCIe 5.0 x16 (discrete)   128 GB/s
NVLink-C2C (GH200)         900 GB/s    ← 7× faster than PCIe Gen5
HBM3e on-chip (GH200)    4,000 GB/s
LPDDR5X (Grace CPU)        512 GB/s
</code></pre></div></div>

<p><em>Source: <a href="https://www.nvidia.com/en-us/data-center/grace-hopper-superchip/">NVIDIA GH200 Grace Hopper Superchip product page</a> and <a href="https://www.amax.com/content/files/2023/12/NVIDIA_Grace_Hopper_Superchip_Architecture_Overview_Whitepaper.pdf">GH200 architecture whitepaper</a>. The 7× PCIe Gen5 figure is NVIDIA’s own stated specification — “NVLink-C2C delivers up to 900 GB/s total bandwidth. This is 7x higher bandwidth than x16 PCIe Gen5 lanes.”</em></p>

<p>On a conventional discrete GPU system — an A100 or H100 in a PCIe slot — the CPU and GPU have separate memory pools connected by a 128 GB/s bus. Moving data between them is expensive. The GPU can’t efficiently use CPU DRAM for KV cache overflow because the PCIe bandwidth is 31× slower than the GPU’s on-chip HBM bandwidth (4,000 ÷ 128 = 31.25×). Any spill to CPU memory becomes a bottleneck.</p>

<p>On the GH200, the Grace CPU’s 480GB of LPDDR5X memory is accessible to the Hopper GPU at 900 GB/s over NVLink-C2C. That’s not as fast as on-chip HBM3e, but it’s fast enough to be genuinely useful. The result is a unified addressable memory space — the GPU sees up to 576GB total (96GB HBM3e + 480GB LPDDR5X) — at bandwidths that make overflow to CPU DRAM a viable architectural choice rather than a last resort.</p>

<figure style="max-width:720px;margin:2rem auto;text-align:center;">
  <img src="/assets/images/llm-inference/grace-hopper-chip-nvidia2.png" alt="Memory architecture comparison for LLM inference — discrete GPU PCIe vs GH200 NVLink-C2C" style="width:100%;" />
  <figcaption style="font-size:0.85rem;color:#888;margin-top:0.5rem;">
    Left: discrete GPU (A100/H100 PCIe) — CPU and GPU have separate memory pools connected by 128 GB/s PCIe Gen5.
    KV cache overflow to CPU DRAM is 31× slower than on-chip HBM, making it unusable in practice.
    Right: GH200 Grace Hopper — CPU and GPU share a unified 576GB address space (96GB HBM3e + 480GB LPDDR5X)
    connected at 900 GB/s via NVLink-C2C. KV cache tiering to CPU memory becomes a viable architectural option,
    not a last resort. This is why llm-d's tiered KV offloading feature maps directly onto GH200 hardware.
  </figcaption>
</figure>

<h3 id="why-this-matters-for-llm-inference">Why This Matters for LLM Inference</h3>

<p>At scale, inference systems are typically <strong>memory-bandwidth bound</strong>, not compute bound. This isn’t universal — small models at low concurrency can be compute-bound during prefill, and some workloads shift the bottleneck elsewhere. But for production multi-tenant serving, the binding constraint is almost always memory: how fast the hardware can load model weights and KV cache tensors for each forward pass. Every decode step reads the full KV cache. On a 7B model with a 4K context window and 32 concurrent requests, that’s dozens of gigabytes moving through the memory subsystem per second — and the rate of that movement, not the number of CUDA cores, determines your throughput ceiling.</p>

<p>The GH200’s NVLink-C2C doesn’t solve this — HBM3e is still the primary memory bandwidth for active inference. But it changes the economics of KV cache management. Tiered KV storage (hot blocks in HBM, warm blocks in LPDDR5X, cold blocks evicted to NVMe) becomes viable in a way it isn’t on PCIe-connected systems. llm-d’s architecture diagram from the previous post already shows tiered KV cache offloading as a first-class feature — the GH200 is the hardware that makes that design practical.</p>

<p>For these experiments, we didn’t exercise KV cache tiering. The model is small and the cache pressure is low. But the architecture is there, and it’s why the GH200 was the right choice for running the full llm-d stack rather than a conventional discrete GPU setup.</p>

<hr />

<h2 id="one-mental-model-before-the-numbers">One Mental Model Before the Numbers</h2>

<p>Before looking at a single metric, here is the frame I’d encourage you to carry into any inference system analysis:</p>

<p><strong>LLM inference is a memory scheduling problem that looks like a compute problem.</strong></p>

<p>Every team I’ve seen optimise inference focuses first on compute — GPU utilization, batch size, model quantization. These matter. But the real leverage is in memory: how much of the KV cache is warm, how often prefill recomputation is avoided, how well the scheduler keeps the right data in the right tier. The system that wins at scale is the system that does the least unnecessary work — and unnecessary prefill recomputation is the biggest source of waste in multi-tenant inference.</p>

<p>This is what EPP prefix cache routing is solving. Not “make the GPU faster.” Make the system do less work.</p>

<hr />

<h2 id="what-these-experiments-actually-test">What These Experiments Actually Test</h2>

<p>Experiments 1–4 use the <code class="language-plaintext highlighter-rouge">inference-scheduling</code> guide from the llm-d repo. This deploys <strong>a single decode pod that handles both prefill and decode</strong> — aggregated serving. The values.yaml is unambiguous:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">prefill</span><span class="pi">:</span>
  <span class="na">create</span><span class="pi">:</span> <span class="no">false</span>   <span class="c1"># one pod does everything</span>
</code></pre></div></div>

<p>The architecture is:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Every request →  Istio Gateway
                      │
                      ▼
                 EPP (Endpoint Picker Plugin)
                 ┌────────────────────────────────┐
                 │  prefix-cache-scorer           │
                 │  queue-depth-scorer            │
                 │  kv-utilization-scorer         │
                 └────────────────────────────────┘
                      │
                      ▼
               Single vLLM decode pod
               (prefill + decode, one process)
               KV cache lives here
</code></pre></div></div>

<p>With one pod, there is no cross-pod routing decision to make. The EPP’s job here is narrower: route requests with matching prefix hashes back to this pod consistently, so its KV cache stays warm. Round-robin with one pod is also consistent routing — but a naive load balancer doesn’t know about prefix hashes, so it can’t make cache-aware decisions when you add a second pod later.</p>

<p><strong>What this experiment answers:</strong> Does EPP prefix-cache-aware routing actually maintain high cache utilization under realistic multi-tenant load? Does the system hold TTFT stable as concurrency grows?</p>

<p><strong>What it doesn’t answer yet:</strong> What happens when you separate prefill and decode onto dedicated pods. That’s the next post.</p>

<p>This single-pod result is the baseline. Everything in the P/D disaggregation post gets measured against what we establish here.</p>

<hr />

<h2 id="the-load-test--simulating-multi-tenant-traffic">The Load Test — Simulating Multi-Tenant Traffic</h2>

<p>The Locust script simulates two traffic types in a 4:1 ratio:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># llmd-locust.py
</span><span class="n">MODEL</span> <span class="o">=</span> <span class="s">"Qwen/Qwen3-0.6B"</span>

<span class="c1"># Three tenant personas — each with a distinct long system prompt
</span><span class="n">TENANTS</span> <span class="o">=</span> <span class="p">[</span>
    <span class="s">"You are a financial analyst assistant specializing in market data. "</span> <span class="o">*</span> <span class="mi">5</span><span class="p">,</span>
    <span class="s">"You are a DevOps engineer assistant specializing in Kubernetes and CI/CD. "</span> <span class="o">*</span> <span class="mi">5</span><span class="p">,</span>
    <span class="s">"You are a data scientist assistant specializing in ML pipelines. "</span> <span class="o">*</span> <span class="mi">5</span><span class="p">,</span>
<span class="p">]</span>

<span class="k">class</span> <span class="nc">LLMDUser</span><span class="p">(</span><span class="n">HttpUser</span><span class="p">):</span>
    <span class="n">wait_time</span> <span class="o">=</span> <span class="n">between</span><span class="p">(</span><span class="mf">0.5</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">on_start</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">tenant_prompt</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">(</span><span class="n">TENANTS</span><span class="p">)</span>  <span class="c1"># sticky for session
</span>
    <span class="o">@</span><span class="n">task</span><span class="p">(</span><span class="mi">4</span><span class="p">)</span>
    <span class="k">def</span> <span class="nf">tenant_request</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="c1"># Repeat system prompt → cache hit opportunity
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">client</span><span class="p">.</span><span class="n">post</span><span class="p">(</span><span class="s">"/v1/chat/completions"</span><span class="p">,</span> <span class="n">json</span><span class="o">=</span><span class="p">{</span>
            <span class="s">"model"</span><span class="p">:</span> <span class="n">MODEL</span><span class="p">,</span>
            <span class="s">"messages"</span><span class="p">:</span> <span class="p">[</span>
                <span class="p">{</span><span class="s">"role"</span><span class="p">:</span> <span class="s">"system"</span><span class="p">,</span> <span class="s">"content"</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">tenant_prompt</span><span class="p">},</span>
                <span class="p">{</span><span class="s">"role"</span><span class="p">:</span> <span class="s">"user"</span><span class="p">,</span> <span class="s">"content"</span><span class="p">:</span> <span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">([</span>
                    <span class="s">"Summarize the key points."</span><span class="p">,</span>
                    <span class="s">"What should I focus on?"</span><span class="p">,</span>
                    <span class="s">"Give me 3 recommendations."</span><span class="p">,</span>
                <span class="p">])}</span>
            <span class="p">],</span>
            <span class="s">"max_tokens"</span><span class="p">:</span> <span class="mi">80</span><span class="p">,</span>
        <span class="p">},</span> <span class="n">name</span><span class="o">=</span><span class="s">"tenant_session"</span><span class="p">)</span>

    <span class="o">@</span><span class="n">task</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
    <span class="k">def</span> <span class="nf">cold_request</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="c1"># No system prompt → cold cache, no prefix benefit
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">client</span><span class="p">.</span><span class="n">post</span><span class="p">(</span><span class="s">"/v1/chat/completions"</span><span class="p">,</span> <span class="n">json</span><span class="o">=</span><span class="p">{</span>
            <span class="s">"model"</span><span class="p">:</span> <span class="n">MODEL</span><span class="p">,</span>
            <span class="s">"messages"</span><span class="p">:</span> <span class="p">[{</span><span class="s">"role"</span><span class="p">:</span> <span class="s">"user"</span><span class="p">,</span>
                          <span class="s">"content"</span><span class="p">:</span> <span class="sa">f</span><span class="s">"Question </span><span class="si">{</span><span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">1000</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">}],</span>
            <span class="s">"max_tokens"</span><span class="p">:</span> <span class="mi">50</span><span class="p">,</span>
        <span class="p">},</span> <span class="n">name</span><span class="o">=</span><span class="s">"cold_request"</span><span class="p">)</span>
</code></pre></div></div>

<p>The design is deliberate. Each simulated user picks a tenant persona at startup and sticks with it — the same ~200-token system prompt on every request. This is representative of real multi-tenant SaaS inference: each customer has a system prompt that defines their product’s persona, and it’s identical on every call. The cold_request tasks (1 in 5) have no system prompt and get no cache benefit — they’re the baseline comparison within the same run.</p>

<p>Requests travel the full path: Mac → SSH tunnel → <code class="language-plaintext highlighter-rouge">kubectl port-forward</code> → Istio gateway → EPP → vLLM pod. No shortcuts.</p>

<hr />

<h2 id="the-results">The Results</h2>

<h3 id="locust--26826-requests-zero-failures">Locust — 26,826 Requests, Zero Failures</h3>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Hardware:  GH200 480GB (single decode pod, aggregated serving)
Model:     Qwen/Qwen3-0.6B
Rate:      32 req/s sustained

Task              Requests    Failures   p50    p95    p99    avg
───────────────────────────────────────────────────────────────────
cold_request      5,450       0 (0%)     200ms  270ms  280ms  207ms
tenant_session    21,376      0 (0%)     260ms  330ms  350ms  273ms
───────────────────────────────────────────────────────────────────
Aggregated        26,826      0 (0%)     260ms  330ms  350ms  260ms
</code></pre></div></div>

<p>Zero failures across 26,826 requests. p99 at 350ms. The system didn’t flinch.</p>

<p><strong>The comparison that matters</strong> — same Locust script structure, same traffic intent, from <a href="/llm-infrastructure/inference/2026/04/16/vllm-ollama-apple-silicon-experiment2.html">the Mac Mini post</a>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Mac Mini vLLM (5 users, 3:1 short/long):
  short prompt p50:   1,900ms
  long prompt  p50:  31,000ms
  TTFT P99:          ~7,000ms under load

llm-d on GH200 (tenant simulation, 4:1 tenant/cold):
  tenant_session p50:   260ms    ← 7.3× faster at median
  cold_request   p50:   200ms
  TTFT P99:              34ms    ← ~200× better tail latency
</code></pre></div></div>

<p>The gap is real, but it deserves honest attribution. Part of it is raw hardware — a GH200 is not an M4 Mac Mini. Part of it is the serving stack — llm-d with EPP routing vs vanilla vLLM. And part of it is the traffic shape — these aren’t identical experiments. What the numbers establish is a clear direction: hardware matters, but the serving architecture amplifies or squanders what the hardware can do.</p>

<hr />

<h2 id="what-grafana-showed">What Grafana Showed</h2>

<h3 id="the-performance-dashboard--read-this-one-first">The Performance Dashboard — Read This One First</h3>

<figure style="max-width:900px;margin:2rem auto;text-align:center;">
  <img src="/assets/images/llm-inference/g3-llm-d-performance-dashboard.png" alt="llm-d Performance Dashboard — TTFT p50 15ms, p95 19ms, KV Cache Hit Rate 80.6%, Request Throughput 28.8 req/s, ITL p50 5ms" style="width:100%;" />
  <figcaption style="font-size:0.85rem;color:#888;margin-top:0.5rem;">
    llm-d Performance Dashboard during the Locust run. Top-left: TTFT p50 at 15ms, p95 at 19ms — flat throughout.
    Top-right: ITL p50 at 5ms, p95 at 9.5ms — decode speed is stable.
    Middle: KV Cache Hit Rate at <strong>80.6%</strong>, Per-Pod at 80.6%.
    Bottom: Request Throughput 28.8 req/s, Request Queue showing active batching without buildup.
  </figcaption>
</figure>

<p>Three numbers worth pausing on:</p>

<p><strong>TTFT p50 at 15ms, flat under load.</strong> The Mac Mini showed TTFT climbing past 2 seconds at p50 and 7 seconds at p99 under comparable concurrency. Here it stays at 15ms throughout. This is not just hardware — it’s what happens when 80% of your requests skip prefill entirely because their KV blocks are already warm. The system is doing less work per request on average, which is why latency stays stable as concurrency grows.</p>

<p>A common mistake in inference optimisation is treating TTFT stability as a tuning parameter — something you achieve by adjusting batch sizes, memory utilization settings, or scheduler parameters. It isn’t. <strong>TTFT stability under load is an architectural property.</strong> It follows from having enough cache hit rate to keep the average prefill cost low. Once cache hit rate drops below ~50%, no amount of tuning recovers the tail latency. The right intervention is upstream: better routing, larger KV cache budgets, or separation of prefill and decode workloads.</p>

<p><strong>KV Cache Hit Rate at 80.6%.</strong> This is the single most actionable metric in multi-tenant inference. It tells you how much work the system is not doing. At 80.6%, roughly 4 in 5 tenant requests reuse cached KV blocks and skip prefill recomputation. The inverse is the expensive number: 19.4% of requests are cold — those are full prefill operations. In a system with 100 GPU-hours of work per day, improving cache hit rate from 80% to 90% saves 10 GPU-hours. That’s not a performance metric. That’s a cost metric.</p>

<p><strong>ITL at 5ms.</strong> Inter-Token Latency is the gap between consecutive tokens during decode — a direct measure of decode throughput. At 5ms per token, that’s 200 tokens per second per request. More importantly it’s flat — it doesn’t increase as the test runs, which confirms there’s no memory pressure or scheduler thrashing affecting the decode phase. When ITL climbs under load, it’s usually a sign that the KV cache is filling and evictions are occurring. Here it’s not.</p>

<hr />

<h3 id="prefix-cache-hit-rate--the-number-behind-the-number">Prefix Cache Hit Rate — The Number Behind the Number</h3>

<figure style="max-width:900px;margin:2rem auto;text-align:center;">
  <img src="/assets/images/llm-inference/g2-llm-d-diagnostic-prefix-cache.png" alt="llm-d Diagnostic Drill-Down — Prefix Cache Hit Rate 81.1%, Per-Instance Hit Rate holding steady at 80%, Decode Worker Utilization, Token Distribution" style="width:100%;" />
  <figcaption style="font-size:0.85rem;color:#888;margin-top:0.5rem;">
    Diagnostic Drill-Down. Middle: Prefix Cache Hit Rate at <strong>81.1%</strong>, Per-Instance line steady at 80%
    throughout the test — not a burst artifact, sustained. Top: Routing section showing near-zero Idle GPU Time
    (GPU is fully utilised) and consistent Token Distribution. Bottom: P/D Disaggregation section showing
    Decode Worker Utilization — the single pod handling all prefill and decode work.
  </figcaption>
</figure>

<p>The 81.1% gauge is the aggregate. The Per-Instance time series at 80% is the more meaningful signal — it shows the cache was warm within the first few minutes and held that level for the entire test duration. This is what stable prefix cache routing looks like: not a spike that decays, but a plateau that holds.</p>

<p><strong>What this means at scale:</strong> At three tenant profiles with ~200-token system prompts, 81% cache hit rate is achievable on a single pod with a small model. As you scale to 50 tenants, 500, or 5000, the picture changes. The working set of system prompts grows beyond what a single pod’s KV cache can hold. Cache hit rate degrades. TTFT rises. This isn’t a failure of EPP routing — it’s a KV cache capacity problem, and the architectural response is either larger KV budgets per pod, more pods with affinity-based routing, or tiered KV offloading. The GH200’s NVLink-C2C is exactly the hardware that makes that third option viable. This experiment doesn’t exercise it. The architecture supports it.</p>

<hr />

<h3 id="e2e-latency-and-scheduler-state">E2E Latency and Scheduler State</h3>

<figure style="max-width:900px;margin:2rem auto;text-align:center;">
  <img src="/assets/images/llm-inference/g4-llm-d-vllm-overview-e2e.png" alt="llm-d vLLM Overview — E2E p50 150ms, Token Throughput 2500 tps, ITL flat, TTFT 15-33ms, Scheduler State stable, Cache Utilization 0.15%" style="width:100%;" />
  <figcaption style="font-size:0.85rem;color:#888;margin-top:0.5rem;">
    vLLM Overview dashboard. Top-left: E2E Request Latency p50 at 150ms, p95 at 260ms, p99 at 300ms, all flat.
    Top-right: Token Throughput at ~2,500 tps combined (prompt + generation).
    Middle-right: Scheduler State — Num Running and Num Waiting both stable, no queue buildup.
    Bottom-right: Cache Utilization at 0.15% — the 0.6B model leaves the entire KV pool available.
  </figcaption>
</figure>

<p>The Scheduler State panel deserves attention. <code class="language-plaintext highlighter-rouge">Num Running</code> is the active batch size — how many sequences share a GPU forward pass. <code class="language-plaintext highlighter-rouge">Num Waiting</code> is the queue depth. Both staying low and flat means continuous batching is absorbing load without queuing. Compare this to the Mac Mini Locust test where <code class="language-plaintext highlighter-rouge">Num Waiting</code> spiked to 5 and TTFT degraded proportionally.</p>

<p>Cache Utilization at 0.15% is a function of model size. Qwen3-0.6B at this concurrency level barely touches the KV pool. The same test with Llama-3-8B would show a fundamentally different curve — and that’s where the GH200’s memory architecture starts mattering in ways that go beyond raw numbers.</p>

<hr />

<h3 id="prefill-vs-decode--the-two-phase-separation-made-visible">Prefill vs Decode — The Two-Phase Separation Made Visible</h3>

<figure style="max-width:900px;margin:2rem auto;text-align:center;">
  <img src="/assets/images/llm-inference/g5-llm-d-vllm-overview-prefill-decode.png" alt="llm-d vLLM Overview scrolled — Requests Prefill and Decode Time showing prefill dropping as cache warms, decode flat throughout, Request Prompt Length heatmap, Queue Time near zero" style="width:100%;" />
  <figcaption style="font-size:0.85rem;color:#888;margin-top:0.5rem;">
    vLLM Overview lower panels. Bottom-left: <strong>Requests Prefill and Decode Time</strong> — yellow is prefill,
    green is decode. Prefill drops as the cache warms during the first minutes of the test, then stabilises.
    Decode stays flat throughout. Queue Time near zero. Top panels: Request Prompt Length heatmap showing
    two clusters — short cold requests and longer tenant prompts.
  </figcaption>
</figure>

<p>This panel is the most architecturally revealing one in the dashboard. The yellow prefill line and the green decode line on the same chart, from the same pod, tell you everything about why P/D disaggregation exists.</p>

<p>Prefill is compute-bound and bursty. Its cost scales with prompt length. It spikes when a cold request arrives. Decode is memory-bandwidth-bound and steady. Its cost scales with the number of tokens being generated. On the same pod, every prefill spike steals GPU time from active decode sequences — those sequences stall mid-generation while the prefill runs.</p>

<p>In a disaggregated setup, the yellow line comes from a prefill pod pool and the green line from a decode pod pool. Prefill spikes are isolated. Decode runs uninterrupted. TTFT for long prompts no longer delays short prompts waiting in the decode queue.</p>

<p>Here those lines share a pod. The system works well at this scale and concurrency — the numbers prove it. But the architectural tension is visible in the chart. That’s what the next post is about.</p>

<hr />

<h3 id="ttft-p99--the-warmup-signature">TTFT P99 — The Warmup Signature</h3>

<figure style="max-width:900px;margin:2rem auto;text-align:center;">
  <img src="/assets/images/llm-inference/g6-llm-d-ttft-p99.png" alt="llm-d Failure and Saturation — TTFT P99 starting at 39ms, dropping to 32ms minimum, stabilising at 34ms for the remainder" style="width:100%;" />
  <figcaption style="font-size:0.85rem;color:#888;margin-top:0.5rem;">
    TTFT P99 over 30 minutes from the Failure and Saturation dashboard. Starts at 39ms (cold cache),
    drops to 32ms minimum as KV blocks accumulate, stabilises at ~34ms for the remainder of the test.
    P99 settling — not drifting upward — is the signal of a system that has reached steady state.
    For context: the Mac Mini TTFT P99 was climbing past 7 seconds under similar concurrent load.
  </figcaption>
</figure>

<p>TTFT P99 dropping from 39ms to 32ms and then holding at 34ms is the cache warmup signature in the tail. The first requests from each tenant session are cold — full prefill. As sessions accumulate, KV blocks warm, and the P99 reflects that. The key signal is the plateau: P99 stops dropping once the cache is warm and doesn’t creep upward under sustained load. This is a system that has found equilibrium.</p>

<hr />

<h3 id="model-throughput-and-queue-state">Model Throughput and Queue State</h3>

<figure style="max-width:900px;margin:2rem auto;text-align:center;">
  <img src="/assets/images/llm-inference/g1-llm-d-diagnostic-throughput.png" alt="llm-d Diagnostic Drill-Down — Model Throughput 4000-5000 tps, Request Queue near zero, KV Cache Utilization 0.15%, Queue Utilization brief spikes draining immediately" style="width:100%;" />
  <figcaption style="font-size:0.85rem;color:#888;margin-top:0.5rem;">
    Diagnostic Drill-Down model serving panels. Model Throughput at 4,000–5,000 tps during the Locust run.
    Request Queue Lengths near zero — requests are served without queuing. Queue Utilization shows brief spikes
    during burst arrivals that drain immediately — continuous batching absorbing load as designed.
    KV Cache Utilization at 0.15%.
  </figcaption>
</figure>

<p>Model Throughput at 4,000–5,000 tokens per second is what a GH200 looks like under moderate load with a small model. The brief Queue Utilization spikes that drain immediately are continuous batching doing its job — burst arrivals get absorbed into the current decode step rather than queuing. This is the architectural behaviour that separates vLLM from naive sequential servers.</p>

<hr />

<h2 id="the-complete-benchmark-reference">The Complete Benchmark Reference</h2>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Hardware:  Lambda Labs GH200 480GB (Grace Hopper)
Model:     Qwen/Qwen3-0.6B
Stack:     llm-d v0.4.0, single decode pod (prefill: create: false)
           EPP: prefix-cache-scorer + queue-scorer + kv-utilization-scorer

─── Locust ───────────────────────────────────────────────────────────
  Total requests:   26,826     Failures:    0 (0%)
  Sustained rate:   32 req/s

  Task             p50    p95    p99    avg    req/s
  tenant_session:  260ms  330ms  350ms  273ms  25.5
  cold_request:    200ms  270ms  280ms  207ms   6.5
  Aggregated:      260ms  330ms  350ms  260ms  32.0

─── Grafana — Performance Dashboard ──────────────────────────────────
  TTFT p50:                 15ms
  TTFT p95:                 19ms
  TTFT P99 (settled):       ~34ms
  ITL p50:                  5ms   (≈ 200 tok/s)
  ITL p95:                  9.5ms
  KV Cache Hit Rate:        80.6% (gauge) / 81.1% (drill-down)
  Request Throughput:       28.8 req/s
  Model Throughput:         4,000–5,000 tps peak
  Cache Utilization:        0.15%

─── vs Mac Mini (Post 2, same script structure) ──────────────────────
  tenant p50:  Mac Mini 1,900ms  →  GH200 260ms   (7.3× faster)
  TTFT P99:    Mac Mini ~7,000ms →  GH200 34ms    (~200× better)
</code></pre></div></div>

<hr />

<h2 id="what-this-proves--and-what-it-doesnt">What This Proves — and What It Doesn’t</h2>

<p><strong>What it proves:</strong></p>

<p>EPP prefix cache routing achieves 81.1% cache hit rate under realistic multi-tenant load. That’s not a configuration artifact — it’s the result of the EPP scoring prefix hashes and routing tenant sessions to the pod holding their warm KV blocks. The system holds TTFT at 15ms p50 and 34ms p99 under 32 req/s sustained, with zero failures.</p>

<p>More broadly: in multi-tenant inference systems with repeated system prompts, <strong>cache hit rate is typically the highest-leverage optimisation variable.</strong> GPU utilization, batch size, and quantization all matter — but they reduce the cost of work the system is already doing. Cache hit rate determines how much work gets skipped entirely. At 81%, the system is doing roughly 5× less prefill work than a round-robin deployment that scatters the same tenant’s requests across pods with cold caches. That ratio is hard to match through hardware improvements alone.</p>

<p><strong>What it doesn’t prove:</strong></p>

<p>This is a single pod, a small model, and three tenant profiles. The working set fits comfortably in the KV cache — hence 0.15% utilization. Real multi-tenant deployments have hundreds or thousands of tenant profiles. As the working set grows, cache hit rate degrades. The EPP routing remains correct, but the pod’s KV cache can’t hold every tenant’s prefix simultaneously. The responses to that problem — larger KV allocations, more pods with affinity routing, tiered KV offloading to the GH200’s LPDDR5X — are architectural decisions that require knowing the working set size and access pattern distribution. This experiment establishes that the routing mechanism works. Capacity planning is a separate problem.</p>

<hr />

<h2 id="what-most-teams-get-wrong">What Most Teams Get Wrong</h2>

<p>Most inference optimisation work focuses on the <strong>supply side</strong> — faster hardware, more efficient models, better batching. This is necessary but insufficient. The demand side — <strong>how requests are shaped and routed before they reach the GPU</strong> — is where the real leverage lives at scale.</p>

<p>The dominant optimisation lever depends on where your system actually sits. Before reaching for more hardware, diagnose the bottleneck:</p>

<table>
  <thead>
    <tr>
      <th>Scenario</th>
      <th>Primary Bottleneck</th>
      <th>Highest-Leverage Fix</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Low cache hit rate (&lt;50%)</td>
      <td>Prefill recomputation</td>
      <td>Routing + request shaping</td>
    </tr>
    <tr>
      <td>High cache hit, low utilization</td>
      <td>Scheduling inefficiency</td>
      <td>Continuous batching config</td>
    </tr>
    <tr>
      <td>High utilization, rising latency</td>
      <td>Memory bandwidth</td>
      <td>Better hardware / parallelism</td>
    </tr>
    <tr>
      <td>High working set (many tenants)</td>
      <td>KV cache capacity</td>
      <td>Tiered cache / pod sharding</td>
    </tr>
  </tbody>
</table>

<p>Most teams misdiagnose their bottleneck and optimise the wrong layer. A team with a 30% cache hit rate buying faster GPUs is solving the wrong problem — they’ll run twice as fast through twice as much unnecessary prefill work. The 81% cache hit rate in these experiments means the system is doing roughly 5× less prefill than a naive round-robin deployment on identical hardware. No GPU upgrade achieves that ratio. Routing does.</p>

<p>The practical entry point: instrument your cache hit rate first. If it’s below 50%, the fix is upstream — consistent system prompts, session-affinity routing, and a scheduler that knows about prefix hashes. Hardware comes after you’ve exhausted the routing lever.</p>

<p><strong>A non-obvious failure mode: high aggregate cache hit rate masking per-tenant unfairness.</strong></p>

<p>A subtle failure appears when cache hit rate is high but unevenly distributed across tenants. If a small number of tenants dominate traffic, their KV blocks stay hot while long-tail tenants constantly miss the cache. The system reports a healthy aggregate hit rate — 80%, 81% — but tail latency degrades because cold tenants always pay full prefill cost.</p>

<p>This leads to a misleading conclusion: “the cache is working.” In reality, the system is biased toward high-frequency tenants. The aggregate metric looks healthy precisely because the popular tenants pull it up — while every new or low-frequency tenant experiences the system as if caching doesn’t exist.</p>

<p>In these experiments, three tenant profiles at similar frequencies produced clean aggregate numbers. Production systems with hundreds of tenants and a power-law access distribution will not. Fixing uneven cache efficiency requires either per-tenant cache accounting to surface the distribution, or explicit isolation of long-tail traffic to prevent it from competing with high-frequency prefixes for the same KV blocks. Without this, cache efficiency improves averages while silently degrading fairness — which is the kind of problem that shows up in p99 SLA breaches, not in dashboard summaries.</p>

<hr />

<h2 id="connecting-the-dots-across-this-series">Connecting the Dots Across This Series</h2>

<p>This series started on an M4 Mac Mini with a 400MB model and a Python script measuring TTFT. It’s worth pausing to trace what the numbers across all three posts are actually saying — because they’re saying the same thing at different scales.</p>

<p><strong>Post 1 established the theory:</strong> decode is memory-bandwidth bound. Prefill competes with decode for the same resources. KV cache management is the central architectural problem in inference. These aren’t vLLM-specific observations — they’re properties of the transformer architecture itself. They hold on any hardware, any framework.</p>

<p><strong>Post 2 ran two experiments on identical hardware.</strong> Ollama vs vLLM on the same M4 chip, same unified memory, same model family. The result — 14,062ms vs 6,543ms p50, <strong>2.15× faster</strong> — had nothing to do with the hardware. Ollama processes requests sequentially. vLLM uses continuous batching. Same silicon, 2× throughput difference from a software architecture decision. That result matters because it isolates the variable: serving architecture, not hardware.</p>

<p>Then under load, even vLLM on the Mac Mini hit the ceiling. Long prompts drove TTFT to 31 seconds at p50. The Mac Mini’s 16GB unified memory pool — shared between model weights, KV cache, and OS — ran out of headroom. The principle from Post 1 materialised as a real number: when the KV cache competes for the same memory as the model weights, concurrency suffers.</p>

<p><strong>This post added two more variables: real GPU hardware and intelligent routing.</strong> The GH200 with NVLink-C2C at 900 GB/s changes the memory economics — the GPU can address 480GB at viable bandwidth, not just 16GB. EPP prefix cache routing adds the third variable: the system avoids prefill work entirely for 81% of requests by keeping KV blocks warm.</p>

<p>The result is 260ms p50 and 34ms P99 at 32 req/s. But attributing that entirely to the GH200 would be wrong — and that’s the point. The Ollama experiment on the Mac Mini already proved that hardware alone doesn’t determine the outcome. The GH200 sets a much higher ceiling. EPP routing determines how close you get to it.</p>

<p><strong>The through-line:</strong> hardware sets the ceiling. Serving architecture determines how close you get to it. This has been true at every scale in this series — a $0 Mac Mini, a $2.29/hr GH200, and everything in between. The engineers who understand this spend their optimisation budget on routing and request shaping first, and on hardware second. The engineers who don’t buy more GPUs and wonder why the numbers don’t improve proportionally.</p>

<hr />

<h2 id="what-comes-next">What Comes Next</h2>

<p>The prefill vs decode time panel in this post showed two lines on the same chart — prefill varying with cache state, decode flat throughout. On this single pod they share a GPU. A prefill spike for a long-prompt request steals compute from ongoing decode sequences. The system handles it at this concurrency. It wouldn’t at 10×.</p>

<p>The next post deploys the P/D disaggregation guide on a second GH200 instance — separate prefill and decode pods, NIXL sidecar for KV cache transfer between them. The setup, the results, and the honest account of what P/D disaggregation actually requires in terms of hardware are all in one post.</p>

<p>The question it answers: given the baseline established here — 81.1% cache hit rate, 260ms p50, 34ms P99 — does separating prefill and decode onto dedicated pods move those numbers, or have we already captured most of the available gain on a single aggregated pod? The answer is more nuanced than either “yes it’s better” or “no it’s not worth it” — and the hardware constraint that makes it nuanced is one the GH200’s architecture makes visible in a way discrete GPU systems don’t.</p>

<hr />

<p><em>Experiments run on Lambda Labs GH200 480GB, llm-d v0.4.0, Qwen3-0.6B, vllm/vllm-openai:latest. Platform engineer with 11+ years in distributed systems going deep on LLM serving infrastructure.</em></p>

<p><em><a href="https://github.com/kraghavan">GitHub</a> · <a href="https://linkedin.com/in/karthikaraghavan">LinkedIn</a></em></p>]]></content><author><name>Karthika Raghavan</name></author><category term="llm-infrastructure" /><category term="inference" /><category term="llm-d" /><category term="epp" /><category term="prefix-cache" /><category term="kubernetes" /><category term="vllm" /><category term="gpu" /><category term="gh200" /><category term="grace-hopper" /><category term="locust" /><category term="prometheus" /><category term="grafana" /><category term="benchmarks" /><category term="inference-optimization" /><summary type="html"><![CDATA[The stack is deployed. Now let's see what it actually does. EPP prefix cache routing, 81.1% KV cache hit rate, TTFT at 15ms p50, and what those numbers mean for teams building multi-tenant inference at scale.]]></summary></entry><entry><title type="html">Deploying llm-d on a Cloud GPU — The 10 Things Nobody Tells You</title><link href="https://kraghavan.github.io/llm-infrastructure/inference/2026/04/17/vllm-llm-d-nvidia-gh200-experiment.html" rel="alternate" type="text/html" title="Deploying llm-d on a Cloud GPU — The 10 Things Nobody Tells You" /><published>2026-04-17T00:00:00+00:00</published><updated>2026-04-17T00:00:00+00:00</updated><id>https://kraghavan.github.io/llm-infrastructure/inference/2026/04/17/vllm-llm-d-nvidia-gh200-experiment</id><content type="html" xml:base="https://kraghavan.github.io/llm-infrastructure/inference/2026/04/17/vllm-llm-d-nvidia-gh200-experiment.html"><![CDATA[<p>Let me set expectations before we start.</p>

<p>I had done <a href="/llm-infrastructure/inference/2026/04/16/vllm-ollama-apple-silicon-experiment2.html">the Mac Mini experiments</a>. I had real benchmark numbers. I understood the theory. I knew what prefill and decode were, I could explain PagedAttention at a whiteboard, and I had wired up Prometheus and Grafana from scratch without breaking anything important.</p>

<p>Then I tried to deploy llm-d on a cloud GPU and spent the better part of a weekend staring at <code class="language-plaintext highlighter-rouge">permission denied</code>, <code class="language-plaintext highlighter-rouge">No such file or directory</code>, <code class="language-plaintext highlighter-rouge">image pull failed</code>, and other messages that are technically informative and emotionally frustrating. I’ll be candid — I walked away from the terminal twice while the instance meter kept running. GPU rental has a way of focusing the mind.</p>

<p>This post is the deployment war story. Every broken thing, in roughly the order it broke. I am not sugarcoating my experiences or providing you a polished “here are the steps”. Those promises exist in the official docs that assumes a level of environmental cooperation or prior experiences that was not there. What the docs don’t tell you is what I’m writing here.</p>

<p>If you’re trying to deploy llm-d yourself, this post will save you several hours and possibly your sanity. If you’re reading this for entertainment, welcome — the GH200 and I had quite a journey.</p>

<hr />

<h2 id="what-is-llm-d-one-diagram-then-we-move-on">What Is llm-d? (One Diagram, Then We Move On)</h2>

<p>llm-d is a Kubernetes-native inference scheduling layer that sits on top of vLLM. It doesn’t replace vLLM — vLLM still runs inside each pod doing exactly what it always does. What llm-d adds is an <strong>EPP (Endpoint Picker Plugin)</strong> — a custom Kubernetes scheduler that routes each incoming request to the <em>right</em> vLLM pod based on KV cache state, queue depth, and prefix cache hit probability.</p>

<p>The pitch: instead of dumb round-robin load balancing, llm-d routes your request to the decode pod that already has your system prompt cached. Lower TTFT, better GPU utilization, independently scalable prefill and decode pools.</p>

<p>The official architecture diagram shows the contrast clearly:</p>

<figure style="max-width:900px;margin:2rem auto;text-align:center;">
  <img src="/assets/images/llm-inference/llm-d-Vs-NGINX.png" alt="llm-d vs legacy NGINX routing — showing prefix-cache-aware routing, P/D disaggregation, tiered KV cache, and SLO-aware autoscaling" style="width:100%;" />
  <figcaption style="font-size:0.85rem;color:#888;margin-top:0.5rem;">
    Left: legacy round-robin — no prefix cache awareness, no prefill/decode split, generic QPS autoscaling.
    Right: llm-d architecture — EPP prefix-cache-aware routing, dedicated prefill and decode pools connected via RDMA,
    tiered KV cache, SLO-aware autoscaling of each pool independently.
    Source: <a href="https://llm-d.ai/docs/architecture">llm-d.ai/docs/architecture</a>
  </figcaption>
</figure>

<p>That’s what you’re deploying. Now let’s talk about what it takes to actually run it.</p>

<hr />

<h2 id="the-hardware">The Hardware</h2>

<p><strong>Lambda Labs GH200</strong>, single instance, Grace Hopper Superchip.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>GPU:          NVIDIA GH200 480GB (unified CPU+GPU memory)
Architecture: ARM64 (aarch64)
OS:           Ubuntu 22.04 LTS
Storage:      1.4TB local SSD
</code></pre></div></div>

<p>The ARM64 part is important — it comes up multiple times in this post in ways that will make you want to scream. Lambda’s GH200 instances come pre-installed with K3s, GPU drivers, and a collection of tools that seem helpful until they quietly conflict with everything you’re trying to do.</p>

<figure style="max-width:900px;margin:2rem auto;text-align:center;">
  <img src="/assets/images/llm-inference/llm-d-on-GH200-lambda-labs.png" alt="llm-d vs legacy NGINX routing — showing prefix-cache-aware routing, P/D disaggregation, tiered KV cache, and SLO-aware autoscaling" style="width:100%;" />
  <figcaption style="font-size:0.85rem;color:#888;margin-top:0.5rem;">
  An architectural overview of our llm-d deployment on a single GH200 node. This diagram illustrates the complete request lifecycle—from the Istio ingress gateway through our custom Edge Processing Proxy (EPP) with intelligent scoring—down to the vLLM inference engine and our observability stack.
  </figcaption>
</figure>

<hr />

<h2 id="before-you-even-start-getting-the-instance">Before You Even Start: Getting the Instance</h2>

<p>There is a step zero that nobody writes about: actually getting the GPU.</p>

<p>Cloud GPU availability — especially for high-end hardware like the GH200 — is genuinely constrained. Lambda Labs operates on a first-come, first-served basis for on-demand instances. When you log in and open the Launch Instance dialog, you will see a list of available GPU types. Some will be available. Some will say “Out of capacity.” The GH200 in particular comes and goes.</p>

<figure style="max-width:800px;margin:2rem auto;text-align:center;">
  <img src="/assets/images/llm-inference/lambda-labs.png" alt="Lambda Labs Launch Instance dialog showing available GPU types — GH200 at $2.29/hr highlighted, B200 instances showing Out of capacity" style="width:100%;" />
  <figcaption style="font-size:0.85rem;color:#888;margin-top:0.5rem;">
    Lambda Labs instance selector. The 1x GH200 (96GB) with ARM64 + H100 at $2.29/hr was the instance used for these experiments.
    Note the B200 instances showing "Out of capacity" — high-end GPU availability is genuinely constrained and changes throughout the day.
    The GH200 is listed as 96GB here but the actual unified memory pool is 480GB — the 96GB refers to the HBM portion.
  </figcaption>
</figure>

<p><strong>Getting set up on Lambda Labs:</strong></p>

<p>Creating an account takes about 5 minutes. Go to <a href="https://lambda.ai">lambda.ai</a>, sign up, add a payment method, and add your SSH public key under SSH Keys in the dashboard. That last step is easy to forget and will stop you from connecting to any instance you launch.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Your SSH public key — paste this into Lambda Labs SSH Keys dashboard</span>
<span class="nb">cat</span> ~/.ssh/id_&lt;my-key&gt;.pub
</code></pre></div></div>

<p>Once your account is ready, the practical strategy for getting a GH200 on-demand is:</p>

<ul>
  <li><strong>Check availability in the morning</strong> — instances turn over as other users terminate their sessions overnight</li>
  <li><strong>Have your SSH key already added</strong> — you want to be able to launch immediately when a slot opens</li>
  <li><strong>Don’t launch and walk away</strong> — you’re paying per hour, so have your setup commands ready to run</li>
  <li><strong>Set a budget alert</strong> — Lambda Labs doesn’t do this automatically; track your usage manually</li>
</ul>

<p><strong>What these experiments actually cost:</strong></p>

<p>I ran two separate sessions, on two separate GH200 instances. Being honest about the numbers:</p>

<table>
  <thead>
    <tr>
      <th>Session</th>
      <th>What ran</th>
      <th>Cost</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Instance <code class="language-plaintext highlighter-rouge">192.222.50.71</code></td>
      <td>Experiments 1–4 (inference-scheduling, EPP routing, Locust load tests)</td>
      <td><strong>$17.47</strong></td>
    </tr>
    <tr>
      <td>Instance <code class="language-plaintext highlighter-rouge">192.222.57.186</code></td>
      <td>Experiments 5–6 (P/D disaggregation, prefill + decode pods, sustained Locust run)</td>
      <td><strong>$18.31</strong></td>
    </tr>
    <tr>
      <td><strong>Total</strong></td>
      <td> </td>
      <td><strong>$35.78</strong></td>
    </tr>
  </tbody>
</table>

<p>The second session was slightly more expensive because the P/D disaggregation setup took longer to get right — more iteration time on a running instance. The first session was faster in wall-clock time but I was also slower at debugging, which explains the comparable cost.</p>

<p>For context: $35.78 for two full days of hands-on GPU infrastructure experiments is genuinely reasonable. A single A100 hour on AWS is $3.50+. Lambda Labs on-demand pricing is competitive precisely because availability isn’t guaranteed — you trade reliability for cost.</p>

<p><strong>One important note for LLMOps Researchers:</strong> I’m running these experiments to gain “production like experience”, which means I’m deliberate about GPU spend. On-demand instances that you terminate when done are the right strategy here — avoid reserved instances or always-on setups until you have a specific recurring workload that justifies the commitment.</p>

<hr />

<h2 id="gotcha-1-the-kubeconfig-belongs-to-root">Gotcha 1: The Kubeconfig Belongs to Root</h2>

<p>First thing you do when you SSH into a new Kubernetes node: check the cluster is running.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl get namespaces
</code></pre></div></div>

<p>What you get instead:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>WARN[0000] Unable to read /etc/rancher/k3s/k3s.yaml, please start server
with --write-kubeconfig-mode or --write-kubeconfig-group to modify kube
config permissions
error: error loading config file "/etc/rancher/k3s/k3s.yaml":
open /etc/rancher/k3s/k3s.yaml: permission denied
</code></pre></div></div>

<p>K3s installs its kubeconfig at <code class="language-plaintext highlighter-rouge">/etc/rancher/k3s/k3s.yaml</code> and owns it as root. Your Ubuntu user is not root. Nobody tells you this in the getting-started guide because it seems like a detail, and it is — right until it silently breaks every single <code class="language-plaintext highlighter-rouge">kubectl</code> and <code class="language-plaintext highlighter-rouge">helm</code> command you run for the next two hours.</p>

<p><strong>The fix:</strong> copy the config to your home directory and set the environment variable.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">mkdir</span> <span class="nt">-p</span> <span class="nv">$HOME</span>/.kube
<span class="nb">sudo cp</span> /etc/rancher/k3s/k3s.yaml <span class="nv">$HOME</span>/.kube/config
<span class="nb">sudo chown </span>ubuntu:ubuntu <span class="nv">$HOME</span>/.kube/config
<span class="nb">export </span><span class="nv">KUBECONFIG</span><span class="o">=</span><span class="nv">$HOME</span>/.kube/config

<span class="c"># Make it permanent — critical for survival across sessions</span>
<span class="nb">echo</span> <span class="s1">'export KUBECONFIG=$HOME/.kube/config'</span> <span class="o">&gt;&gt;</span> ~/.bashrc
<span class="nb">source</span> ~/.bashrc
</code></pre></div></div>

<p>Now <code class="language-plaintext highlighter-rouge">kubectl get nodes</code> works. You feel a small surge of optimism. Cherish it.</p>

<hr />

<h2 id="gotcha-2-lambda-ships-snap-helm-and-it-is-not-your-friend">Gotcha 2: Lambda Ships Snap Helm and It Is Not Your Friend</h2>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>which helm
<span class="c"># /snap/bin/helm</span>

<span class="nb">ls</span> <span class="nt">-lla</span> /snap/bin/helm
<span class="c"># lrwxrwxrwx 1 root root 13 → /usr/bin/snap</span>
</code></pre></div></div>

<p>Lambda Stack pre-installs Helm via snap. Snap packages run in a sandbox with PATH and permission constraints that quietly break plugin installations and kubeconfig resolution. The llm-d deployment requires specific Helm plugins (<code class="language-plaintext highlighter-rouge">helm-diff</code>) and specific versions. The snap Helm and plugin system do not get along cleanly on ARM64.</p>

<p><strong>The fix:</strong> remove snap Helm and install the official arm64 binary.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Remove snap helm first</span>
<span class="nb">sudo </span>snap remove helm

<span class="c"># Install official arm64 binary</span>
wget https://get.helm.sh/helm-v3.19.0-linux-arm64.tar.gz
<span class="nb">tar </span>xzf helm-v3.19.0-linux-arm64.tar.gz
<span class="nb">sudo mv </span>linux-arm64/helm /usr/local/bin/helm
<span class="nb">rm</span> <span class="nt">-rf</span> linux-arm64 helm-v3.19.0-linux-arm64.tar.gz

<span class="c"># Verify</span>
helm version
<span class="c"># version.BuildInfo{Version:"v3.19.0"...}</span>

<span class="c"># Export and persist</span>
<span class="nb">export </span><span class="nv">HELM_BIN</span><span class="o">=</span>/usr/local/bin/helm
<span class="nb">echo</span> <span class="s1">'export HELM_BIN=/usr/local/bin/helm'</span> <span class="o">&gt;&gt;</span> ~/.bashrc
<span class="nb">source</span> ~/.bashrc
</code></pre></div></div>

<p>Notice I added <code class="language-plaintext highlighter-rouge">HELM_BIN</code> to <code class="language-plaintext highlighter-rouge">.bashrc</code>. This is foreshadowing.</p>

<hr />

<h2 id="gotcha-3-environment-variables-die-when-you-close-the-terminal">Gotcha 3: Environment Variables Die When You Close the Terminal</h2>

<p>This one is deceptively simple and caused a disproportionate amount of grief.</p>

<p>llm-d’s <code class="language-plaintext highlighter-rouge">helmfile</code> commands use <code class="language-plaintext highlighter-rouge">HELM_BIN</code> to find the Helm binary, and take the namespace via <code class="language-plaintext highlighter-rouge">-n $NAMESPACE</code>. Both are environment variables. Both need to exist in every session.</p>

<p>SSH sessions do not carry your previous session’s exported variables. Every time you reconnect, those variables are gone — and the error messages when they’re missing are spectacularly unhelpful:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># What happens when NAMESPACE is empty:</span>
helmfile apply <span class="nt">-n</span> <span class="k">${</span><span class="nv">NAMESPACE</span><span class="k">}</span>
<span class="c"># flag needs an argument: 'n' in -n</span>
</code></pre></div></div>

<p>That message doesn’t say “your environment variable is empty.” It sends you hunting through Helm documentation for a flag you’ve never seen, before you eventually notice the problem.</p>

<p><strong>The fix:</strong> put everything in <code class="language-plaintext highlighter-rouge">~/.bashrc</code> and verify at the start of every session.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">echo</span> <span class="s1">'export KUBECONFIG=$HOME/.kube/config'</span> <span class="o">&gt;&gt;</span> ~/.bashrc
<span class="nb">echo</span> <span class="s1">'export HELM_BIN=/usr/local/bin/helm'</span> <span class="o">&gt;&gt;</span> ~/.bashrc
<span class="nb">echo</span> <span class="s1">'export NAMESPACE=llm-d'</span> <span class="o">&gt;&gt;</span> ~/.bashrc
<span class="nb">source</span> ~/.bashrc

<span class="c"># First command of every session:</span>
<span class="nb">echo</span> <span class="nv">$KUBECONFIG</span> <span class="nv">$HELM_BIN</span> <span class="nv">$NAMESPACE</span>
<span class="c"># /home/ubuntu/.kube/config /usr/local/bin/helm llm-d</span>
</code></pre></div></div>

<p>If those three don’t print correctly, nothing downstream will work.</p>

<hr />

<h2 id="gotcha-4-the-default-valuesyaml-will-crash-your-gpu">Gotcha 4: The Default values.yaml Will Crash Your GPU</h2>

<p>When you clone the llm-d repo and open the inference-scheduling guide’s <code class="language-plaintext highlighter-rouge">values.yaml</code>, the default looks like this:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">modelArtifacts</span><span class="pi">:</span>
  <span class="na">uri</span><span class="pi">:</span> <span class="s2">"</span><span class="s">hf://Qwen/Qwen3-32B/tensor"</span>
  <span class="na">name</span><span class="pi">:</span> <span class="s2">"</span><span class="s">Qwen/Qwen3-32B"</span>
  <span class="na">size</span><span class="pi">:</span> <span class="s">80Gi</span>
</code></pre></div></div>

<p>Qwen3-32B. Eighty gigabytes. On a single GPU.</p>

<p>Here is why this matters. GPU memory is not a bottomless pool — it gets divided between everything vLLM needs to run:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>┌─────────────────────────────────────────────────────────┐
│              GH200 — 480GB Unified Memory               │
├──────────────────────┬──────────────────────────────────┤
│  Model Weights       │  Qwen3-32B FP16 ≈ 65GB           │
│  (static, loaded     │  Qwen3-0.6B 4-bit ≈ 0.4GB        │
│   once at startup)   │                                  │
├──────────────────────┼──────────────────────────────────┤
│  KV Cache            │  Grows per token, per request    │
│  (dynamic, grows     │  Fills whatever is left over     │
│   with context)      │                                  │
├──────────────────────┼──────────────────────────────────┤
│  CUDA Graphs,        │  vLLM pre-allocates ~10–15%      │
│  Activations,        │  for warmup and graph capture    │
│  Overhead            │                                  │
└──────────────────────┴──────────────────────────────────┘
</code></pre></div></div>

<figure style="max-width:800px;margin:2rem auto;text-align:center;">
  <img src="/assets/images/llm-inference/qwen-inside-GH200.png" alt="Lambda Labs Launch Instance dialog showing available GPU types — GH200 at $2.29/hr highlighted, B200 instances showing Out of capacity" style="width:100%;" />
  <figcaption style="font-size:0.85rem;color:#888;margin-top:0.5rem;">
    GPU memory utilization breakdown on a 480GB node: Qwen3-32B vs. Qwen3-0.6B. This comparison highlights the memory pressure caused by large model weights (65GB) versus the massive KV cache headroom unlocked by using a highly optimized, smaller model.
  </figcaption>
</figure>

<p>With Qwen3-32B, model weights alone consume 65GB. vLLM then pre-allocates KV cache for the maximum sequence length on top. Under concurrent load you are pushing the GPU extremely hard before a single user request arrives — and you’ve also got 20+ minutes of model download and warmup before you can even test anything.</p>

<p>With Qwen3-0.6B (~400MB), the model loads in under 2 minutes and 99% of the GPU is available for KV cache and experiments. That’s the version to start with.</p>

<p><strong>The fix:</strong> replace the values file entirely. The full working <code class="language-plaintext highlighter-rouge">values.yaml</code> is below — note two critical changes from the default: model is Qwen3-0.6B, and the image is changed (explained in the next gotcha).</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># ms-inference-scheduling/values.yaml — working version</span>
<span class="na">multinode</span><span class="pi">:</span> <span class="no">false</span>

<span class="na">modelArtifacts</span><span class="pi">:</span>
  <span class="na">uri</span><span class="pi">:</span> <span class="s2">"</span><span class="s">hf://Qwen/Qwen3-0.6B"</span>
  <span class="na">name</span><span class="pi">:</span> <span class="s2">"</span><span class="s">Qwen/Qwen3-0.6B"</span>
  <span class="na">size</span><span class="pi">:</span> <span class="s">2Gi</span>
  <span class="na">authSecretName</span><span class="pi">:</span> <span class="s2">"</span><span class="s">llm-d-hf-token"</span>
  <span class="na">labels</span><span class="pi">:</span>
    <span class="na">llm-d.ai/inference-serving</span><span class="pi">:</span> <span class="s2">"</span><span class="s">true"</span>
    <span class="na">llm-d.ai/guide</span><span class="pi">:</span> <span class="s2">"</span><span class="s">inference-scheduling"</span>
    <span class="na">llm-d.ai/accelerator-variant</span><span class="pi">:</span> <span class="s2">"</span><span class="s">gpu"</span>
    <span class="na">llm-d.ai/accelerator-vendor</span><span class="pi">:</span> <span class="s2">"</span><span class="s">nvidia"</span>
    <span class="na">llm-d.ai/model</span><span class="pi">:</span> <span class="s2">"</span><span class="s">Qwen3-0.6B"</span>

<span class="na">routing</span><span class="pi">:</span>
  <span class="na">proxy</span><span class="pi">:</span>
    <span class="na">enabled</span><span class="pi">:</span> <span class="no">false</span>
    <span class="na">targetPort</span><span class="pi">:</span> <span class="m">8000</span>

<span class="na">accelerator</span><span class="pi">:</span>
  <span class="na">type</span><span class="pi">:</span> <span class="s">nvidia</span>

<span class="na">decode</span><span class="pi">:</span>
  <span class="na">create</span><span class="pi">:</span> <span class="no">true</span>
  <span class="na">parallelism</span><span class="pi">:</span>
    <span class="na">tensor</span><span class="pi">:</span> <span class="m">1</span>
    <span class="na">data</span><span class="pi">:</span> <span class="m">1</span>
  <span class="na">replicas</span><span class="pi">:</span> <span class="m">1</span>
  <span class="na">monitoring</span><span class="pi">:</span>
    <span class="na">podmonitor</span><span class="pi">:</span>
      <span class="na">enabled</span><span class="pi">:</span> <span class="no">true</span>
      <span class="na">portName</span><span class="pi">:</span> <span class="s2">"</span><span class="s">vllm"</span>
      <span class="na">path</span><span class="pi">:</span> <span class="s2">"</span><span class="s">/metrics"</span>
      <span class="na">interval</span><span class="pi">:</span> <span class="s2">"</span><span class="s">30s"</span>
  <span class="na">containers</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s2">"</span><span class="s">vllm"</span>
      <span class="na">image</span><span class="pi">:</span> <span class="s">vllm/vllm-openai:latest</span>   <span class="c1"># ← critical change, see next gotcha</span>
      <span class="na">modelCommand</span><span class="pi">:</span> <span class="s">vllmServe</span>
      <span class="na">args</span><span class="pi">:</span>
        <span class="pi">-</span> <span class="s2">"</span><span class="s">--disable-uvicorn-access-log"</span>
        <span class="pi">-</span> <span class="s2">"</span><span class="s">--gpu-memory-utilization=0.90"</span>
      <span class="na">ports</span><span class="pi">:</span>
        <span class="pi">-</span> <span class="na">containerPort</span><span class="pi">:</span> <span class="m">8000</span>
          <span class="na">name</span><span class="pi">:</span> <span class="s">vllm</span>
          <span class="na">protocol</span><span class="pi">:</span> <span class="s">TCP</span>
      <span class="na">resources</span><span class="pi">:</span>
        <span class="na">limits</span><span class="pi">:</span>
          <span class="na">cpu</span><span class="pi">:</span> <span class="s1">'</span><span class="s">16'</span>
          <span class="na">memory</span><span class="pi">:</span> <span class="s">60Gi</span>
        <span class="na">requests</span><span class="pi">:</span>
          <span class="na">cpu</span><span class="pi">:</span> <span class="s1">'</span><span class="s">8'</span>
          <span class="na">memory</span><span class="pi">:</span> <span class="s">30Gi</span>
      <span class="na">mountModelVolume</span><span class="pi">:</span> <span class="no">true</span>
      <span class="na">volumeMounts</span><span class="pi">:</span>
        <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">metrics-volume</span>
          <span class="na">mountPath</span><span class="pi">:</span> <span class="s">/.config</span>
        <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">shm</span>
          <span class="na">mountPath</span><span class="pi">:</span> <span class="s">/dev/shm</span>
        <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">torch-compile-cache</span>
          <span class="na">mountPath</span><span class="pi">:</span> <span class="s">/.cache</span>
      <span class="na">startupProbe</span><span class="pi">:</span>
        <span class="na">httpGet</span><span class="pi">:</span>
          <span class="na">path</span><span class="pi">:</span> <span class="s">/v1/models</span>
          <span class="na">port</span><span class="pi">:</span> <span class="s">vllm</span>
        <span class="na">initialDelaySeconds</span><span class="pi">:</span> <span class="m">15</span>
        <span class="na">periodSeconds</span><span class="pi">:</span> <span class="m">30</span>
        <span class="na">timeoutSeconds</span><span class="pi">:</span> <span class="m">5</span>
        <span class="na">failureThreshold</span><span class="pi">:</span> <span class="m">120</span>
      <span class="na">livenessProbe</span><span class="pi">:</span>
        <span class="na">httpGet</span><span class="pi">:</span>
          <span class="na">path</span><span class="pi">:</span> <span class="s">/health</span>
          <span class="na">port</span><span class="pi">:</span> <span class="s">vllm</span>
        <span class="na">periodSeconds</span><span class="pi">:</span> <span class="m">10</span>
        <span class="na">timeoutSeconds</span><span class="pi">:</span> <span class="m">5</span>
        <span class="na">failureThreshold</span><span class="pi">:</span> <span class="m">3</span>
      <span class="na">readinessProbe</span><span class="pi">:</span>
        <span class="na">httpGet</span><span class="pi">:</span>
          <span class="na">path</span><span class="pi">:</span> <span class="s">/v1/models</span>
          <span class="na">port</span><span class="pi">:</span> <span class="s">vllm</span>
        <span class="na">periodSeconds</span><span class="pi">:</span> <span class="m">5</span>
        <span class="na">timeoutSeconds</span><span class="pi">:</span> <span class="m">2</span>
        <span class="na">failureThreshold</span><span class="pi">:</span> <span class="m">3</span>
  <span class="na">volumes</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">metrics-volume</span>
      <span class="na">emptyDir</span><span class="pi">:</span> <span class="pi">{}</span>
    <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">torch-compile-cache</span>
      <span class="na">emptyDir</span><span class="pi">:</span> <span class="pi">{}</span>
    <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">shm</span>
      <span class="na">emptyDir</span><span class="pi">:</span>
        <span class="na">medium</span><span class="pi">:</span> <span class="s">Memory</span>
        <span class="na">sizeLimit</span><span class="pi">:</span> <span class="s">20Gi</span>

<span class="na">prefill</span><span class="pi">:</span>
  <span class="na">create</span><span class="pi">:</span> <span class="no">false</span>
</code></pre></div></div>

<p>Start small. Prove the stack works. Scale the model up later.</p>

<hr />

<h2 id="gotcha-5-llm-d-cuda-is-x86-only">Gotcha 5: <code class="language-plaintext highlighter-rouge">llm-d-cuda</code> Is x86-Only</h2>

<p>The default values.yaml uses <code class="language-plaintext highlighter-rouge">ghcr.io/llm-d/llm-d-cuda:v0.6.0</code> as the vLLM container image. This image was built for x86_64 (amd64). The GH200 is ARM64 (aarch64).</p>

<p>When Kubernetes tries to run an x86 image on an ARM64 node, it doesn’t fail with “wrong architecture.” It fails with a Triton compiler error deep inside the container startup, several minutes after the pod appears to be Running:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Triton compilation failed: RuntimeError: ...
CUDA error: device-side assert triggered
</code></pre></div></div>

<p>The pod crashes. CrashLoopBackOff. <code class="language-plaintext highlighter-rouge">kubectl describe pod</code> shows the image pulled successfully. The logs look like GPU memory issues. You spend time adjusting <code class="language-plaintext highlighter-rouge">--gpu-memory-utilization</code> and redeploying — none of which helps, because the problem is binary architecture, not memory configuration.</p>

<p><strong>The fix:</strong> <code class="language-plaintext highlighter-rouge">vllm/vllm-openai:latest</code> is multi-arch and includes a proper ARM64 build.</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Change this:</span>
<span class="na">image</span><span class="pi">:</span> <span class="s">ghcr.io/llm-d/llm-d-cuda:v0.6.0</span>

<span class="c1"># To this:</span>
<span class="na">image</span><span class="pi">:</span> <span class="s">vllm/vllm-openai:latest</span>
</code></pre></div></div>

<p>One line. Saves two hours.</p>

<hr />

<h2 id="gotcha-6-gateway-api-crds-dont-come-with-the-cluster">Gotcha 6: Gateway API CRDs Don’t Come With the Cluster</h2>

<p>llm-d uses the Kubernetes Gateway API — <code class="language-plaintext highlighter-rouge">HTTPRoute</code> and <code class="language-plaintext highlighter-rouge">Gateway</code> custom resources. These are not part of standard Kubernetes and do not come pre-installed with K3s.</p>

<p>When <code class="language-plaintext highlighter-rouge">helmfile apply</code> runs without these CRDs, it fails with unknown resource type errors. If you’re not familiar with the Gateway API, you’ll look at your Istio installation instead of at the missing CRDs.</p>

<p><strong>The fix:</strong> install both sets of CRDs before anything else.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Gateway API CRDs</span>
kubectl apply <span class="nt">-f</span> https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.1/standard-install.yaml

<span class="c"># llm-d InferencePool CRDs</span>
kubectl apply <span class="nt">-f</span> https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v0.3.0/manifests.yaml

<span class="c"># Verify both</span>
kubectl get crd | <span class="nb">grep</span> <span class="nt">-E</span> <span class="s2">"gateway|inference"</span>
</code></pre></div></div>

<p>Ten seconds to apply. Saves a debugging session.</p>

<hr />

<h2 id="gotcha-7-use-helm-template--kubectl-apply--k8s-debugging-101">Gotcha 7: Use <code class="language-plaintext highlighter-rouge">helm template | kubectl apply</code> — K8s Debugging 101</h2>

<p><code class="language-plaintext highlighter-rouge">helmfile apply</code> is the documented deployment path. It’s also the one that, on K3s ARM64, would silently partially deploy — some resources applied, others skipped, exit code 0.</p>

<p>This is actually a universal Helm debugging pattern worth internalising regardless of llm-d: <strong>render the chart to YAML first, then apply</strong>. Helmfile adds an abstraction layer that can obscure what’s actually being sent to the API server. <code class="language-plaintext highlighter-rouge">helm template</code> removes that layer and gives you full visibility.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Step 1: Render to YAML and inspect — no cluster changes</span>
<span class="nv">$HELM_BIN</span> template ms-inference-scheduling <span class="se">\</span>
  llm-d-modelservice/llm-d-modelservice <span class="se">\</span>
  <span class="nt">--namespace</span> llm-d <span class="se">\</span>
  <span class="nt">--values</span> ms-inference-scheduling/values.yaml <span class="se">\</span>
  | less

<span class="c"># Step 2: Dry-run — validates against cluster API, shows what would change</span>
<span class="nv">$HELM_BIN</span> template ms-inference-scheduling <span class="se">\</span>
  llm-d-modelservice/llm-d-modelservice <span class="se">\</span>
  <span class="nt">--namespace</span> llm-d <span class="se">\</span>
  <span class="nt">--values</span> ms-inference-scheduling/values.yaml <span class="se">\</span>
  | kubectl apply <span class="nt">-n</span> llm-d <span class="nt">-f</span> - <span class="nt">--dry-run</span><span class="o">=</span>client

<span class="c"># Step 3: Apply for real</span>
<span class="nv">$HELM_BIN</span> template ms-inference-scheduling <span class="se">\</span>
  llm-d-modelservice/llm-d-modelservice <span class="se">\</span>
  <span class="nt">--namespace</span> llm-d <span class="se">\</span>
  <span class="nt">--values</span> ms-inference-scheduling/values.yaml <span class="se">\</span>
  | kubectl apply <span class="nt">-n</span> llm-d <span class="nt">-f</span> -
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">--dry-run=client</code> step is the one most people skip and most people regret skipping. It validates your YAML against the cluster’s API schemas, catches missing CRDs, and shows exactly what would be created or updated — before anything changes in the cluster. Whenever a Helm deployment behaves unexpectedly, this three-step render-validate-apply pattern is where to start debugging.</p>

<hr />

<h2 id="gotcha-8-the-httproute-is-not-applied-by-helmfile">Gotcha 8: The HTTPRoute Is Not Applied by helmfile</h2>

<p>After all pods are running, you port-forward the gateway and send a test request. You get nothing back.</p>

<p>The reason: the HTTPRoute — the rule that tells the gateway to forward traffic to the EPP — is not deployed by helmfile. It lives in a separate YAML file and must be applied manually.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Check first:</span>
kubectl get httproute <span class="nt">-n</span> llm-d
<span class="c"># No resources found in llm-d namespace.</span>

<span class="c"># Apply:</span>
kubectl apply <span class="nt">-f</span> ~/llm-d/llm-d/guides/inference-scheduling/httproute.yaml <span class="nt">-n</span> llm-d

<span class="c"># Verify:</span>
kubectl get httproute <span class="nt">-n</span> llm-d
<span class="c"># NAME                         AGE</span>
<span class="c"># llm-d-inference-scheduling   6s</span>
</code></pre></div></div>

<p>Without this, the gateway has no routing rules. Every request returns empty or times out. No errors appear in pod logs because — from the gateway’s perspective — nothing is wrong. There is just no route configured.</p>

<hr />

<h2 id="gotcha-9-immutable-selector-labels-mean-you-cant-upgrade-in-place">Gotcha 9: Immutable Selector Labels Mean You Can’t Upgrade In-Place</h2>

<p>If you change the model name in <code class="language-plaintext highlighter-rouge">values.yaml</code> after deploying, <code class="language-plaintext highlighter-rouge">helm upgrade</code> will fail:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Error: UPGRADE FAILED: cannot patch "ms-inference-scheduling-llm-d-modelservice-decode"
with kind Deployment: Deployment.apps is invalid:
spec.selector: Invalid value: ... field is immutable
</code></pre></div></div>

<p>Kubernetes Deployment selector labels are immutable. Changing the model name — which is part of the label selector — requires deleting the Deployment first.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl delete deployment <span class="nt">-n</span> llm-d <span class="se">\</span>
  ms-inference-scheduling-llm-d-modelservice-decode 2&gt;/dev/null <span class="o">||</span> <span class="nb">true</span>

<span class="nv">$HELM_BIN</span> template ms-inference-scheduling <span class="se">\</span>
  llm-d-modelservice/llm-d-modelservice <span class="se">\</span>
  <span class="nt">--namespace</span> llm-d <span class="se">\</span>
  <span class="nt">--values</span> ms-inference-scheduling/values.yaml <span class="se">\</span>
  | kubectl apply <span class="nt">-n</span> llm-d <span class="nt">-f</span> -
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">|| true</code> prevents failure if the deployment doesn’t exist yet — useful for idempotent scripts.</p>

<hr />

<h2 id="gotcha-10-port-forward-processes-die-silently-and-dont-tell-you">Gotcha 10: Port-Forward Processes Die Silently and Don’t Tell You</h2>

<p>Access to the llm-d gateway from your Mac goes through an SSH tunnel plus <code class="language-plaintext highlighter-rouge">kubectl port-forward</code>. When the port-forward dies — session timeout, network hiccup — the TCP port on your Mac stays bound. Next tunnel attempt:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Unable to listen on port 8080:
[unable to create listener: Error listen tcp4 127.0.0.1:8080: bind: address already in use]
</code></pre></div></div>

<p><strong>The fix:</strong> kill the zombie first, then verify the fresh tunnel is alive before benchmarking.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Kill whatever holds port 8080</span>
lsof <span class="nt">-ti</span> :8080 | xargs <span class="nb">kill</span> <span class="nt">-9</span> 2&gt;/dev/null <span class="o">||</span> <span class="nb">true</span>

<span class="c"># Fresh tunnel</span>
ssh <span class="nt">-L</span> 8080:localhost:8080 ubuntu@&lt;LAMBDA_IP&gt; <span class="se">\</span>
  <span class="s2">"KUBECONFIG=/home/ubuntu/.kube/config kubectl port-forward </span><span class="se">\</span><span class="s2">
   -n llm-d svc/infra-inference-scheduling-inference-gateway-istio 8080:80"</span>

<span class="c"># Verify before doing anything else</span>
curl <span class="nt">-s</span> http://localhost:8080/v1/models | jq .data[0].id
<span class="c"># "Qwen/Qwen3-0.6B"  ← tunnel alive</span>
<span class="c"># (nothing / timeout) ← restart the tunnel</span>
</code></pre></div></div>

<p>Running Locust against a dead port-forward gives you 0 completed requests and makes you think the model is broken. Always verify first.</p>

<hr />

<h2 id="what-running-looks-like">What Running Looks Like</h2>

<p>After all of the above — correct kubeconfig, real Helm binary, environment variables in <code class="language-plaintext highlighter-rouge">.bashrc</code>, small model, right image, Gateway API CRDs installed, HTTPRoute applied, Deployment deleted and recreated when needed — this is what a healthy deployment looks like:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl get pods <span class="nt">-n</span> llm-d

NAME                                                              READY   STATUS    RESTARTS   AGE
gaie-inference-scheduling-epp-584f797cc8-4gvw8                    1/1     Running   0          68m
infra-inference-scheduling-inference-gateway-istio-7c5546dr7kd2   1/1     Running   0          68m
ms-inference-scheduling-llm-d-modelservice-decode-57678587gqzlt   1/1     Running   0          10m
</code></pre></div></div>

<p>Three pods, all <code class="language-plaintext highlighter-rouge">1/1 Running</code>. The decode pod may restart once while the model downloads — that’s normal. Zero restarts on EPP and gateway.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Gateway is programmed and routing</span>
kubectl get gateway <span class="nt">-n</span> llm-d
<span class="c"># infra-inference-scheduling-inference-gateway   istio   True   68m</span>

<span class="c"># A request through the gateway returns a response</span>
curl <span class="nt">-s</span> http://localhost:8080/v1/chat/completions <span class="se">\</span>
  <span class="nt">-H</span> <span class="s2">"Content-Type: application/json"</span> <span class="se">\</span>
  <span class="nt">-d</span> <span class="s1">'{"model":"Qwen/Qwen3-0.6B",
       "messages":[{"role":"user","content":"What is KV cache?"}],
       "max_tokens":50}'</span> | jq .choices[0].message.content
<span class="c"># "&lt;think&gt;\nOkay, so I need to figure out what KV cache is..."</span>
</code></pre></div></div>

<p>That response, after everything above, feels like a small engineering miracle.</p>

<hr />

<h2 id="the-observability-stack">The Observability Stack</h2>

<p>The kube-prometheus-stack installs Grafana with 33 dashboards. After a fresh deploy, confirm everything is healthy before running experiments:</p>

<figure style="max-width:900px;margin:2rem auto;text-align:center;">
  <img src="/assets/images/llm-inference/grafana-overview-llmd.png" alt="Grafana Overview — 0 alerts firing, 33 dashboards loaded, Grafana v12.4.2 healthy" style="width:100%;" />
  <figcaption style="font-size:0.85rem;color:#888;margin-top:0.5rem;">
    Grafana Overview after successful llm-d deployment: 0 alerts firing, 33 dashboards loaded, API server responding.
    The two dashboards that matter for experiments: <strong>llm-d Performance Dashboard</strong> (TTFT, ITL, KV cache hit rate)
    and <strong>llm-d vLLM Overview</strong> (E2E latency, scheduler state, prefill vs decode time split).
  </figcaption>
</figure>

<p>If the llm-d dashboards show “No data”, check that the PodMonitor was created:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl get podmonitor <span class="nt">-n</span> llm-d
<span class="c"># If missing, ensure values.yaml has monitoring.podmonitor.enabled: true</span>
</code></pre></div></div>

<hr />

<h2 id="the-survival-checklist">The Survival Checklist</h2>

<p>First commands of every session — before touching anything else:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># 1. Environment variables present</span>
<span class="nb">echo</span> <span class="nv">$KUBECONFIG</span> <span class="nv">$HELM_BIN</span> <span class="nv">$NAMESPACE</span>
<span class="c"># /home/ubuntu/.kube/config /usr/local/bin/helm llm-d</span>

<span class="c"># 2. All pods running</span>
kubectl get pods <span class="nt">-n</span> llm-d
<span class="c"># All 3 pods: 1/1 Running</span>

<span class="c"># 3. HTTPRoute exists</span>
kubectl get httproute <span class="nt">-n</span> llm-d

<span class="c"># 4. Port-forward alive</span>
curl <span class="nt">-s</span> http://localhost:8080/v1/models | jq .data[0].id
<span class="c"># "Qwen/Qwen3-0.6B"</span>

<span class="c"># 5. Grafana accessible</span>
curl <span class="nt">-s</span> http://localhost:3000/api/health | jq .database
<span class="c"># "ok"</span>
</code></pre></div></div>

<p>If any of these fail, fix it before proceeding.</p>

<hr />

<h2 id="the-full-deployment-sequence">The Full Deployment Sequence</h2>

<p>For anyone who wants the complete working sequence without the narrative:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># ── Fix environment ─────────────────────────────────────────────────</span>
<span class="nb">mkdir</span> <span class="nt">-p</span> <span class="nv">$HOME</span>/.kube
<span class="nb">sudo cp</span> /etc/rancher/k3s/k3s.yaml <span class="nv">$HOME</span>/.kube/config
<span class="nb">sudo chown </span>ubuntu:ubuntu <span class="nv">$HOME</span>/.kube/config
<span class="nb">echo</span> <span class="s1">'export KUBECONFIG=$HOME/.kube/config'</span> <span class="o">&gt;&gt;</span> ~/.bashrc
<span class="nb">echo</span> <span class="s1">'export HELM_BIN=/usr/local/bin/helm'</span> <span class="o">&gt;&gt;</span> ~/.bashrc
<span class="nb">echo</span> <span class="s1">'export NAMESPACE=llm-d'</span> <span class="o">&gt;&gt;</span> ~/.bashrc
<span class="nb">source</span> ~/.bashrc

<span class="c"># ── Install real Helm (ARM64) ───────────────────────────────────────</span>
<span class="nb">sudo </span>snap remove helm
wget https://get.helm.sh/helm-v3.19.0-linux-arm64.tar.gz
<span class="nb">tar </span>xzf helm-v3.19.0-linux-arm64.tar.gz
<span class="nb">sudo mv </span>linux-arm64/helm /usr/local/bin/helm

<span class="c"># ── Install CRDs ────────────────────────────────────────────────────</span>
kubectl apply <span class="nt">-f</span> https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.1/standard-install.yaml
kubectl apply <span class="nt">-f</span> https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v0.3.0/manifests.yaml

<span class="c"># ── Repos and HuggingFace secret ────────────────────────────────────</span>
git clone https://github.com/llm-d/llm-d.git
<span class="nv">$HELM_BIN</span> repo add llm-d-modelservice <span class="se">\</span>
  https://llm-d-incubation.github.io/llm-d-modelservice/
<span class="nv">$HELM_BIN</span> repo update

kubectl create namespace <span class="nv">$NAMESPACE</span>
kubectl create secret generic llm-d-hf-token <span class="se">\</span>
  <span class="nt">--from-literal</span><span class="o">=</span><span class="s2">"HF_TOKEN=</span><span class="k">${</span><span class="nv">HF_TOKEN</span><span class="k">}</span><span class="s2">"</span> <span class="se">\</span>
  <span class="nt">--namespace</span> <span class="s2">"</span><span class="k">${</span><span class="nv">NAMESPACE</span><span class="k">}</span><span class="s2">"</span> <span class="se">\</span>
  <span class="nt">--dry-run</span><span class="o">=</span>client <span class="nt">-o</span> yaml | kubectl apply <span class="nt">-f</span> -

<span class="c"># ── Deploy infra (EPP + gateway) ────────────────────────────────────</span>
<span class="nb">cd</span> ~/llm-d/llm-d/guides/inference-scheduling
<span class="nv">HELM_BIN</span><span class="o">=</span><span class="nv">$HELM_BIN</span> helmfile apply <span class="nt">-n</span> <span class="k">${</span><span class="nv">NAMESPACE</span><span class="k">}</span>

<span class="c"># ── Deploy modelservice (reliable pattern) ──────────────────────────</span>
<span class="nv">$HELM_BIN</span> template ms-inference-scheduling <span class="se">\</span>
  llm-d-modelservice/llm-d-modelservice <span class="se">\</span>
  <span class="nt">--namespace</span> llm-d <span class="se">\</span>
  <span class="nt">--values</span> ms-inference-scheduling/values.yaml <span class="se">\</span>
  | kubectl apply <span class="nt">-n</span> llm-d <span class="nt">-f</span> -

<span class="c"># ── HTTPRoute (always manual) ───────────────────────────────────────</span>
kubectl apply <span class="nt">-f</span> httproute.yaml <span class="nt">-n</span> llm-d

<span class="c"># ── Verify ──────────────────────────────────────────────────────────</span>
kubectl get pods <span class="nt">-n</span> llm-d
curl <span class="nt">-s</span> http://localhost:8080/v1/models | jq <span class="nb">.</span>
</code></pre></div></div>

<p>Every step is here because I needed it.</p>

<hr />

<h2 id="what-this-unlocks">What This Unlocks</h2>

<p>With the stack running, the EPP is making routing decisions on every request — consulting prefix cache, queue depth, and KV cache utilization scorers on each incoming call. You just can’t see it yet with a single decode pod and no load.</p>

<p>The next post in this series covers what happens when you actually put the system under load — EPP prefix cache routing in action, KV cache hit rate climbing to <strong>81.1%</strong> in Grafana, TTFT stabilising at <strong>15ms p50</strong> under sustained concurrent traffic, and the Locust results from a system that’s routing intelligently rather than guessing.</p>

<p>The deployment pain was worth it. The numbers make that clear.</p>

<hr />

<p><em>Deployed on Lambda Labs GH200 480GB, K3s, llm-d v0.4.0, Qwen3-0.6B, vllm/vllm-openai:latest. Scripts will be made available soon via github repository. Platform engineer with 11+ years in distributed systems going deep on LLM serving infrastructure.</em></p>

<p><em><a href="https://github.com/kraghavan">GitHub</a> · <a href="https://linkedin.com/in/karthikaraghavan">LinkedIn</a></em></p>]]></content><author><name>Karthika Raghavan</name></author><category term="llm-infrastructure" /><category term="inference" /><category term="llm-d" /><category term="kubernetes" /><category term="k3s" /><category term="helm" /><category term="vllm" /><category term="gpu" /><category term="lambda-labs" /><category term="gh200" /><category term="arm64" /><category term="deployment" /><category term="sre" /><summary type="html"><![CDATA[I deployed llm-d on a Lambda Labs GH200. Nothing worked first try. Here is the honest account of what broke, why, and how to fix it — so you don't spend your GPU budget finding out the hard way.]]></summary></entry><entry><title type="html">Treating the M4 Mac Mini Like a Production Inference Server (It Tried)</title><link href="https://kraghavan.github.io/llm-infrastructure/inference/2026/04/16/vllm-ollama-apple-silicon-experiment2.html" rel="alternate" type="text/html" title="Treating the M4 Mac Mini Like a Production Inference Server (It Tried)" /><published>2026-04-16T00:00:00+00:00</published><updated>2026-04-16T00:00:00+00:00</updated><id>https://kraghavan.github.io/llm-infrastructure/inference/2026/04/16/vllm-ollama-apple-silicon-experiment2</id><content type="html" xml:base="https://kraghavan.github.io/llm-infrastructure/inference/2026/04/16/vllm-ollama-apple-silicon-experiment2.html"><![CDATA[<p>I spent a week running inference experiments on my M4 Mac Mini — not to build a product, but to understand what actually happens inside an LLM serving stack. TTFT. TPOT. KV cache. Continuous batching. These are words I had read in papers and blog posts. This week I measured them. Here is what I found.</p>

<p>This post is the hands-on companion to <a href="/llm-infrastructure/inference/2026/04/14/re-introduction-to-inference.html">Post 1</a>. Same mental model — but now with real hardware, real Grafana dashboards, and numbers you can reproduce yourself.</p>

<p><strong>Hardware:</strong> M4 Mac Mini, 16GB unified memory<br />
<strong>Model:</strong> <code class="language-plaintext highlighter-rouge">mlx-community/Qwen3-0.6B-4bit</code> (~400MB, 4-bit quantized)<br />
<strong>Inference engine:</strong> vllm-metal 0.13.0 (official vllm-project Apple Silicon plugin)<br />
<strong>Observability:</strong> Prometheus + Grafana via Docker Compose<br />
<strong>Load testing:</strong> vegeta + Locust<br />
<strong>Gateway experiment:</strong> kind cluster + nginx reverse proxy</p>

<figure style="max-width:800px;margin:2rem auto;text-align:center;">
  <img src="/assets/images/llm-inference/setup-architecture.png" alt="Local inference setup — vllm-metal serving Qwen3-0.6B-4bit, Prometheus and Grafana via Docker Compose, Locust and vegeta for load testing, nginx on kind as a K8s gateway experiment" style="width:100%;" />
  <figcaption style="font-size:0.85rem;color:#888;margin-top:0.5rem;">
    The full local setup: vllm-metal serving Qwen3-0.6B-4bit on M4 Mac Mini unified memory, 
    Prometheus scraping <code>/metrics</code> every 5s, Grafana dashboards rendering live, 
    Locust and vegeta generating load. The kind cluster with nginx is the K8s gateway 
    experiment — simulating the Envoy position in a production llm-d deployment.
  </figcaption>
</figure>

<hr />

<h2 id="why-apple-silicon-for-this-experiment">Why Apple Silicon for This Experiment?</h2>

<p>The M4 Mac Mini has 16GB of unified memory — shared between CPU and GPU. This is both a strength and a constraint for inference.</p>

<p><strong>Strength:</strong> Unified memory means the model weights aren’t copied between CPU RAM and a discrete GPU’s VRAM. Zero-copy tensor operations. For a small quantized model, this is genuinely fast.</p>

<p><strong>Constraint:</strong> 16GB is 16GB. The model, the KV cache, the OS, and every other process share the same pool. Under concurrent load, you will find the ceiling.</p>

<p>The more interesting reason: I wanted to validate that the mental model from Post 1 — prefill is compute-bound, decode is memory-bound, KV cache is the critical resource — holds on real hardware, not just in theory. It does. With caveats.</p>

<hr />

<h2 id="the-setup">The Setup</h2>

<h3 id="why-vllm-metal-not-ollama">Why vllm-metal, Not Ollama?</h3>

<p>Before any benchmarks: I used both, and the choice matters more than I expected.</p>

<p><strong>vllm-metal</strong> is the official vllm-project Apple Silicon plugin. It runs Metal GPU kernels via MLX, exposes a full OpenAI-compatible API, and critically — emits Prometheus metrics out of the box. Same codebase as cloud vLLM, same API surface, same metric names. What you learn here transfers directly to a GPU cluster.</p>

<p><strong>Ollama</strong> is simpler to install and great for single-user local use. But it doesn’t expose Prometheus metrics natively, and — as the benchmark section will show — it doesn’t implement continuous batching. Under concurrent load, that difference is not subtle.</p>

<p>One gotcha worth flagging: <code class="language-plaintext highlighter-rouge">vllm-mlx</code> (different from vllm-metal) is a third-party wrapper that was broken with <code class="language-plaintext highlighter-rouge">mlx-lm&gt;=0.31.0</code> as of April 2026. Use <code class="language-plaintext highlighter-rouge">vllm-metal</code> — the official plugin — and avoid that detour.</p>

<p><strong>A note on scope:</strong> this post covers vLLM and Ollama only. TGI, TensorRT-LLM, ExLlamaV2, and SGLang are all on the list — but running meaningful benchmarks against each requires dedicated GPU time, and GPU time costs money I’m not spending during a job search. I’ll get to them when the experiments justify the cost. For now, the vLLM vs Ollama comparison is grounded in real numbers on real hardware, and that’s the comparison worth making.</p>

<h3 id="observability-stack">Observability Stack</h3>

<p>Prometheus and Grafana running via Docker Compose, with <code class="language-plaintext highlighter-rouge">host.docker.internal</code> resolving to the Mac’s loopback so the containers can scrape vLLM’s metrics endpoint:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># docker-compose.yml</span>
<span class="na">services</span><span class="pi">:</span>
  <span class="na">prometheus</span><span class="pi">:</span>
    <span class="na">image</span><span class="pi">:</span> <span class="s">prom/prometheus:latest</span>
    <span class="na">extra_hosts</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s2">"</span><span class="s">host.docker.internal:host-gateway"</span>
    <span class="na">ports</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s2">"</span><span class="s">9090:9090"</span>
    <span class="na">volumes</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">./prometheus.yml:/etc/prometheus/prometheus.yml</span>

  <span class="na">grafana</span><span class="pi">:</span>
    <span class="na">image</span><span class="pi">:</span> <span class="s">grafana/grafana:latest</span>
    <span class="na">ports</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s2">"</span><span class="s">3000:3000"</span>
    <span class="na">environment</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">GF_AUTH_ANONYMOUS_ENABLED=true</span>
      <span class="pi">-</span> <span class="s">GF_AUTH_ANONYMOUS_ORG_ROLE=Admin</span>
</code></pre></div></div>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># prometheus.yml</span>
<span class="na">global</span><span class="pi">:</span>
  <span class="na">scrape_interval</span><span class="pi">:</span> <span class="s">5s</span>

<span class="na">scrape_configs</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="na">job_name</span><span class="pi">:</span> <span class="s">vllm</span>
    <span class="na">static_configs</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="na">targets</span><span class="pi">:</span>
          <span class="pi">-</span> <span class="s1">'</span><span class="s">host.docker.internal:8000'</span>
</code></pre></div></div>

<p><strong>Metric name gotcha:</strong> the official vLLM Grafana dashboard references <code class="language-plaintext highlighter-rouge">gpu_cache_usage_perc</code> but vllm-metal exposes <code class="language-plaintext highlighter-rouge">vllm:kv_cache_usage_perc</code>. If your KV cache panel shows “No data”, confirm the correct metric name directly:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl localhost:8000/metrics | <span class="nb">grep </span>cache
</code></pre></div></div>

<h3 id="the-kind-cluster-and-nginx-gateway">The kind Cluster and nginx Gateway</h3>

<p><strong>Why this matters:</strong> In production with llm-d, requests flow through an Envoy gateway pod before reaching vLLM pods. I wanted to replicate that topology locally — understand the gateway layer before Week 2 introduced it with real routing logic.</p>

<p>On Apple Silicon, Metal GPU cannot be passed into Docker containers. So vllm-metal has to run natively on the Mac host. The resulting topology is deliberately artificial:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>── LOCAL SETUP (Apple Silicon constraint) ───────────────────────────

curl localhost:9000
     │
     ▼  host port 9000 → kind NodePort 30000
nginx pod :80        ← simulates Envoy gateway position in llm-d
     │
     ▼  proxy_pass http://172.19.0.1:8000
vllm-metal           ← running natively on Mac host (Metal GPU)

── PRODUCTION (Week 2 — real GPU) ──────────────────────────────────

curl gateway:80
     │
     ▼
Envoy gateway pod    ← EPP does KV-cache-aware routing here
     │
     ▼
vLLM pod             ← FastAPI server + GPU inside the pod
</code></pre></div></div>

<p><strong>Apple Silicon gotcha:</strong> <code class="language-plaintext highlighter-rouge">host.docker.internal</code> inside kind on M4 Mac resolves to IPv6, but vllm-metal only binds to IPv4. The nginx <code class="language-plaintext highlighter-rouge">proxy_pass</code> fails silently — you’ll see <code class="language-plaintext highlighter-rouge">connect() failed (101: Network unreachable)</code> in the logs. Fix: get the actual bridge gateway IP:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">HOST_IP</span><span class="o">=</span><span class="si">$(</span>docker inspect vllm-lab-control-plane <span class="se">\</span>
  <span class="nt">--format</span> <span class="s1">''</span><span class="si">)</span>
<span class="nb">echo</span> <span class="nv">$HOST_IP</span>
<span class="c"># 172.19.0.1  ← use this in proxy_pass, not host.docker.internal</span>
</code></pre></div></div>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># kind-config.yaml</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">Cluster</span>
<span class="na">apiVersion</span><span class="pi">:</span> <span class="s">kind.x-k8s.io/v1alpha4</span>
<span class="na">nodes</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="na">role</span><span class="pi">:</span> <span class="s">control-plane</span>
    <span class="na">extraPortMappings</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="na">containerPort</span><span class="pi">:</span> <span class="m">30000</span>
        <span class="na">hostPort</span><span class="pi">:</span> <span class="m">9000</span>
</code></pre></div></div>

<figure style="max-width:720px;margin:2rem auto;text-align:center;">
  <img src="/assets/images/llm-inference/LLM-Inference-Network-Topology.png" alt="Local vs production inference topology — nginx on kind vs Envoy in llm-d" style="width:100%;" />
  <figcaption style="font-size:0.85rem;color:#888;margin-top:0.5rem;">Left: local Apple Silicon setup — nginx in kind proxying to vllm-metal on the Mac host. Right: production llm-d topology — Envoy gateway routing to vLLM pods with direct GPU access. The gateway position is identical; only the backend location differs.</figcaption>
</figure>

<hr />

<h2 id="experiment-1-baseline--ttft-and-tpot-directly-measured">Experiment 1: Baseline — TTFT and TPOT Directly Measured</h2>

<p>First question: what does a single uncontested request actually cost on this hardware?</p>

<p>I wrote a streaming latency script that timestamps the first chunk arrival (TTFT) then tracks per-token intervals (TPOT):</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">measure_streaming</span><span class="p">(</span><span class="n">prompt</span><span class="p">,</span> <span class="n">max_tokens</span><span class="o">=</span><span class="mi">100</span><span class="p">):</span>
    <span class="n">start</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">perf_counter</span><span class="p">()</span>
    <span class="n">first_token_time</span> <span class="o">=</span> <span class="bp">None</span>
    <span class="n">token_times</span> <span class="o">=</span> <span class="p">[]</span>

    <span class="k">with</span> <span class="n">client</span><span class="p">.</span><span class="n">stream</span><span class="p">(</span><span class="s">"POST"</span><span class="p">,</span> <span class="s">"/v1/chat/completions"</span><span class="p">,</span> <span class="n">json</span><span class="o">=</span><span class="n">payload</span><span class="p">)</span> <span class="k">as</span> <span class="n">resp</span><span class="p">:</span>
        <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">resp</span><span class="p">.</span><span class="n">iter_lines</span><span class="p">():</span>
            <span class="k">if</span> <span class="n">content</span> <span class="p">:</span><span class="o">=</span> <span class="n">parse_token</span><span class="p">(</span><span class="n">line</span><span class="p">):</span>
                <span class="n">now</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">perf_counter</span><span class="p">()</span>
                <span class="k">if</span> <span class="n">first_token_time</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
                    <span class="n">first_token_time</span> <span class="o">=</span> <span class="n">now</span>  <span class="c1"># ← TTFT captured here
</span>                <span class="n">token_times</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">now</span><span class="p">)</span>

    <span class="c1"># TPOT = average gap between consecutive tokens
</span>    <span class="n">intervals</span> <span class="o">=</span> <span class="p">[</span><span class="n">token_times</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="n">token_times</span><span class="p">[</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
                 <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">token_times</span><span class="p">))]</span>
    <span class="n">avg_tpot</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">intervals</span><span class="p">)</span> <span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="n">intervals</span><span class="p">)</span>
</code></pre></div></div>

<p><strong>Results:</strong></p>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Short prompt (~20 tokens)</th>
      <th>Long prompt (~500 tokens)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>TTFT</strong></td>
      <td><strong>341ms</strong></td>
      <td><strong>487ms</strong> (+42%)</td>
    </tr>
    <tr>
      <td><strong>TPOT</strong></td>
      <td><strong>3.4ms</strong></td>
      <td><strong>4.2ms</strong> (barely changed)</td>
    </tr>
    <tr>
      <td>Throughput</td>
      <td>145.8 tok/s</td>
      <td>110.2 tok/s</td>
    </tr>
  </tbody>
</table>

<p>The asymmetry is the point. TTFT increased 42% when the prompt got 25× longer — more tokens to process in parallel during prefill. But TPOT barely moved (3.4ms → 4.2ms) because decode generates one token at a time regardless of prompt length. Once the model starts generating, the input is already in the KV cache and irrelevant to decode speed.</p>

<blockquote>
  <p>TPOT is determined by model size and hardware bandwidth. TTFT is determined by how long your user made their prompt.</p>
</blockquote>

<hr />

<h2 id="experiment-2-prefix-cache--51-ttft-reduction-for-free">Experiment 2: Prefix Cache — 51% TTFT Reduction for Free</h2>

<p>KV cache stores computed key-value tensors for each token so they don’t need recomputing on subsequent steps. Prefix caching extends this across requests: if two requests share the same system prompt, the second reuses the KV cache blocks from the first.</p>

<p>I tested this with a long shared system prompt (~50 repetitions, creating ~200 tokens of identical prefix):</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">SYSTEM_PROMPT</span> <span class="o">=</span> <span class="s">"You are a helpful assistant. "</span> <span class="o">*</span> <span class="mi">50</span>  <span class="c1"># long shared prefix
</span>
<span class="n">chat</span><span class="p">(</span><span class="s">"What is 2+2?"</span><span class="p">,</span>   <span class="s">"First request (cold cache)"</span><span class="p">)</span>
<span class="n">chat</span><span class="p">(</span><span class="s">"What is 3+3?"</span><span class="p">,</span>   <span class="s">"Second request (warm cache)"</span><span class="p">)</span>
<span class="n">chat</span><span class="p">(</span><span class="s">"What is 4+4?"</span><span class="p">,</span>   <span class="s">"Third request (warmer cache)"</span><span class="p">)</span>
</code></pre></div></div>

<p><strong>Results:</strong></p>

<table>
  <thead>
    <tr>
      <th>Request</th>
      <th>TTFT</th>
      <th>Notes</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>First (cold cache)</td>
      <td><strong>753ms</strong></td>
      <td>Full prefill — all tokens computed</td>
    </tr>
    <tr>
      <td>Second (warm cache)</td>
      <td><strong>367ms</strong></td>
      <td>51% reduction — prefix blocks reused</td>
    </tr>
    <tr>
      <td>Third (warmer cache)</td>
      <td><strong>364ms</strong></td>
      <td>Marginal further improvement</td>
    </tr>
    <tr>
      <td>Session cache hit ratio</td>
      <td><strong>17.7%</strong></td>
      <td>Across the full test session</td>
    </tr>
  </tbody>
</table>

<p>51% TTFT reduction. Zero model changes. No infrastructure changes. Just sending the same system prompt byte-for-byte across requests.</p>

<p>The operational implication: your production system prompt should be at the front of every request, identical every time. Anything that mutates it per-request — timestamp injection, per-user personalization in the system prompt, A/B testing different prompts — kills the cache hit rate and silently taxes every user’s TTFT.</p>

<p>This is also the per-request proof of what llm-d’s EPP prefix-cache scorer does at cluster scale: route requests to the decode pod that already holds relevant KV cache blocks. What I measured locally as a 51% reduction is what the EPP maximises across dozens of pods.</p>

<hr />

<h2 id="experiment-3-kv-cache-pressure-under-concurrent-load">Experiment 3: KV Cache Pressure Under Concurrent Load</h2>

<p>I fired 4 long concurrent requests simultaneously — each asking for 500 output tokens — to observe how continuous batching handled them:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># kv_pressure.py
</span><span class="k">def</span> <span class="nf">send_long_request</span><span class="p">(</span><span class="n">i</span><span class="p">):</span>
    <span class="n">r</span> <span class="o">=</span> <span class="n">httpx</span><span class="p">.</span><span class="n">post</span><span class="p">(</span><span class="s">"http://localhost:8000/v1/chat/completions"</span><span class="p">,</span>
        <span class="n">json</span><span class="o">=</span><span class="p">{</span><span class="s">"model"</span><span class="p">:</span> <span class="n">MODEL</span><span class="p">,</span>
              <span class="s">"messages"</span><span class="p">:</span> <span class="p">[{</span><span class="s">"role"</span><span class="p">:</span> <span class="s">"user"</span><span class="p">,</span>
                           <span class="s">"content"</span><span class="p">:</span> <span class="sa">f</span><span class="s">"Write a very long essay about distributed systems topic </span><span class="si">{</span><span class="n">i</span><span class="si">}</span><span class="s">. Be extremely verbose and detailed."</span><span class="p">}],</span>
              <span class="s">"max_tokens"</span><span class="p">:</span> <span class="mi">500</span><span class="p">},</span>
        <span class="n">timeout</span><span class="o">=</span><span class="mi">120</span><span class="p">)</span>
    <span class="n">tokens</span> <span class="o">=</span> <span class="n">r</span><span class="p">.</span><span class="n">json</span><span class="p">()[</span><span class="s">"usage"</span><span class="p">][</span><span class="s">"completion_tokens"</span><span class="p">]</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Request </span><span class="si">{</span><span class="n">i</span><span class="si">}</span><span class="s">: </span><span class="si">{</span><span class="n">tokens</span><span class="si">}</span><span class="s"> tokens in </span><span class="si">{</span><span class="n">elapsed</span><span class="si">:</span><span class="p">.</span><span class="mi">1</span><span class="n">f</span><span class="si">}</span><span class="s">s (</span><span class="si">{</span><span class="n">tokens</span><span class="o">/</span><span class="n">elapsed</span><span class="si">:</span><span class="p">.</span><span class="mi">1</span><span class="n">f</span><span class="si">}</span><span class="s"> tok/s)"</span><span class="p">)</span>

<span class="k">with</span> <span class="n">concurrent</span><span class="p">.</span><span class="n">futures</span><span class="p">.</span><span class="n">ThreadPoolExecutor</span><span class="p">(</span><span class="n">max_workers</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span> <span class="k">as</span> <span class="n">ex</span><span class="p">:</span>
    <span class="n">futures</span> <span class="o">=</span> <span class="p">[</span><span class="n">ex</span><span class="p">.</span><span class="n">submit</span><span class="p">(</span><span class="n">send_long_request</span><span class="p">,</span> <span class="n">i</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">4</span><span class="p">)]</span>
</code></pre></div></div>

<p><strong>Results:</strong></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Request 0: 500 tokens in 24.9s (20.1 tok/s)
Request 1: 500 tokens in 24.9s (20.1 tok/s)
Request 2: 500 tokens in 24.9s (20.1 tok/s)
Request 3: 500 tokens in 24.9s (20.1 tok/s)
</code></pre></div></div>

<p>All four completed simultaneously — continuous batching working correctly. In a naive sequential server, 4 requests of this size would take roughly 4× the single-request time. Instead, vLLM batched decode steps across all four active requests in a single GPU pass per iteration.</p>

<p>During this test, <code class="language-plaintext highlighter-rouge">vllm:kv_cache_usage_perc</code> climbed to 4% in Grafana and returned to baseline when all requests completed. On a 0.6B model with these short prompts, plenty of headroom. The same pattern on a larger model pushes KV cache toward the 85% danger threshold where evictions begin and latency spikes.</p>

<hr />

<h2 id="experiment-4-locust-mixed-traffic--where-the-m4-hits-its-ceiling">Experiment 4: Locust Mixed Traffic — Where the M4 Hits Its Ceiling</h2>

<p>Real traffic isn’t uniform. I configured Locust to simulate a 3:1 mix of short chatbot-style prompts and long document-summarization prompts:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">InferenceUser</span><span class="p">(</span><span class="n">HttpUser</span><span class="p">):</span>
    <span class="n">wait_time</span> <span class="o">=</span> <span class="n">between</span><span class="p">(</span><span class="mf">0.5</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>

    <span class="o">@</span><span class="n">task</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>
    <span class="k">def</span> <span class="nf">short_request</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>   <span class="c1"># 75% of traffic — "What is 2+2?" etc, max_tokens=50
</span>
    <span class="o">@</span><span class="n">task</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
    <span class="k">def</span> <span class="nf">long_request</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>    <span class="c1"># 25% of traffic — "Explain transformer attention...", max_tokens=200
</span></code></pre></div></div>

<p><strong>Results at 5 concurrent users:</strong></p>

<table>
  <thead>
    <tr>
      <th>Task</th>
      <th>avg</th>
      <th>min</th>
      <th>p50</th>
      <th>max</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>short_prompt</td>
      <td>9,471ms</td>
      <td>710ms</td>
      <td><strong>1,900ms</strong></td>
      <td>23,219ms</td>
    </tr>
    <tr>
      <td>long_prompt</td>
      <td>16,830ms</td>
      <td>3,152ms</td>
      <td><strong>31,000ms</strong></td>
      <td>30,508ms</td>
    </tr>
  </tbody>
</table>

<figure style="max-width:900px;margin:2rem auto;text-align:center;">
  <img src="/assets/images/llm-inference/locust-test-2.png" alt="Grafana during Locust mixed traffic — E2E latency p50/p95/p99, queue time, inter-token latency, token generation rate" style="width:100%;" />
  <figcaption style="font-size:0.85rem;color:#888;margin-top:0.5rem;">
    Top-left: Token Generation — spike when the Locust test starts, trailing off as the M4 saturates under mixed load.
    Top-right: Request Generation Length heatmap — two clear clusters (short bottom, long top) confirming the 3:1 traffic mix is exercised.
    Middle-left: E2E Request Latency p50/p95/p99 — P99 reaches ~1 minute for long prompts. P50 looks acceptable but hides the long-tail pain.
    Middle-right: Queue Time — spikes to 0.25s during the burst, showing requests queuing before prefill even starts.
    Bottom-left: Inter Token Latency — stays flat at 5–10ms throughout. Decode is not the bottleneck. It is prefill queuing that is killing TTFT.
    Bottom-right: Max Generation Token in Sequence Group — peaks at ~100 tokens, showing the batch composition during the test.
  </figcaption>
</figure>

<p>The long prompt p50 of 31 seconds is not a bug — it’s the fundamental prefill/decode competition at scale. Long prompts trigger expensive prefill operations that block decode steps for all concurrent requests. Short-prompt users feel it as TTFT spikes. Long-prompt users wait in a growing queue.</p>

<p>Notice that Inter Token Latency stays flat throughout. <strong>Once a request gets GPU time for decode, it’s fine. The problem is always getting to the front of the queue.</strong> This is exactly the problem P/D disaggregation solves — dedicate separate GPUs to prefill so long prompts never preempt decode.</p>

<hr />

<h2 id="experiment-5-vllm-vs-ollama--the-head-to-head">Experiment 5: vLLM vs Ollama — The Head-to-Head</h2>

<p>I benchmarked vllm-metal against Ollama using vegeta at a sustained 3 req/s over 30 seconds. The primary comparison is Ollama qwen2.5:0.5b vs vllm-metal Qwen3-0.6B-4bit — same model family, same approximate parameter count, different serving engines. Mistral 7B is included as a reference only.</p>

<figure style="max-width:720px;margin:2rem auto;text-align:center;">
  <img src="/assets/images/llm-inference/LLM-Latency-Comparison.png" alt="Bar chart — Ollama vs vLLM-metal latency at p50, p90, p95. vLLM 2.15x faster at p50." style="width:100%;" />
  <figcaption style="font-size:0.85rem;color:#888;margin-top:0.5rem;">vLLM-metal vs Ollama at sustained 3 req/s. Same hardware, comparable model sizes. The gap is continuous batching, not raw compute.</figcaption>
</figure>

<table>
  <thead>
    <tr>
      <th>Engine</th>
      <th>Model</th>
      <th>p50</th>
      <th>p95</th>
      <th>Min</th>
      <th>Success</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Ollama</td>
      <td>mistral 7B <em>(reference — 10× larger)</em></td>
      <td>15,979ms</td>
      <td>26,896ms</td>
      <td>5,157ms</td>
      <td>100%</td>
    </tr>
    <tr>
      <td>Ollama</td>
      <td>qwen2.5:0.5b <em>(primary comparison)</em></td>
      <td>14,062ms</td>
      <td>20,012ms</td>
      <td>2,569ms</td>
      <td>100%</td>
    </tr>
    <tr>
      <td>Ollama</td>
      <td>qwen3.5 (large)</td>
      <td>timeout</td>
      <td>timeout</td>
      <td>13,039ms</td>
      <td><strong>2.2%</strong></td>
    </tr>
    <tr>
      <td><strong>vllm-metal</strong></td>
      <td><strong>Qwen3-0.6B-4bit</strong> <em>(primary comparison)</em></td>
      <td><strong>6,543ms</strong></td>
      <td><strong>10,952ms</strong></td>
      <td><strong>974ms</strong></td>
      <td><strong>100%</strong></td>
    </tr>
  </tbody>
</table>

<p><strong>vLLM at p50: 2.15× faster. At p95: 1.83× faster.</strong></p>

<p>The minimum latency of 974ms — sub-second — is the clearest signal: when the server isn’t saturated, continuous batching delivers a first response before Ollama has even started processing. Ollama’s 2,569ms minimum reflects its sequential model — each request waits for the previous one to complete before the GPU is available.</p>

<p>The qwen3.5 failure (2.2% success) is instructive. A larger model at 3 req/s causes Ollama’s queue to back up until clients time out. vllm-metal handles the same rate at 100% success because it batches multiple requests into each GPU forward pass rather than serialising them.</p>

<figure style="max-width:900px;margin:2rem auto;text-align:center;">
  <img src="/assets/images/llm-inference/locust-test-1.png" alt="Grafana — scheduler state (num running vs num waiting), token throughput, TTFT over 2 minutes, cache utilisation" style="width:100%;" />
  <figcaption style="font-size:0.85rem;color:#888;margin-top:0.5rem;">
    Top-left: <code>gpu_cache_usage_perc</code> shows "No data" — this is the wrong metric name for vllm-metal. Use <code>vllm:kv_cache_usage_perc</code> instead.
    Top-right: Token Throughput — spikes to ~80 tok/s during the load test, confirming concurrent batching is active.
    Middle-left: Requests Waiting vs Running — spikes at 14:45 are the concurrent requests being admitted. Queue depth briefly reaches 5 before draining.
    Middle-right: Scheduler State — Num Running peaks at 4–5 simultaneously, all requests making progress at once.
    Bottom-left: TTFT over 2m — climbs during load, recovers as the queue clears. The correlation between queue depth and TTFT is direct.
    Bottom-right: Cache Utilisation — peaks at ~4% during load, returns to baseline after. Plenty of KV cache headroom on a 0.6B model.
  </figcaption>
</figure>

<hr />

<h2 id="experiment-6-the-nginx-k8s-gateway--what-the-proxy-layer-actually-costs">Experiment 6: The nginx K8s Gateway — What the Proxy Layer Actually Costs</h2>

<p>With the kind cluster running and nginx proxying to vllm-metal, I ran inference requests through the full gateway path. I used <code class="language-plaintext highlighter-rouge">max_tokens=20</code> — tiny inference — so the measured latency is mostly proxy traversal overhead:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vegeta attack <span class="nt">-rate</span><span class="o">=</span>3/s <span class="nt">-duration</span><span class="o">=</span>15s <span class="se">\</span>
  <span class="nt">-targets</span><span class="o">=</span>&lt;<span class="o">(</span><span class="nb">echo</span> <span class="s2">"POST http://localhost:9000/v1/chat/completions
Content-Type: application/json
@/tmp/vllm_gateway.json"</span><span class="o">)</span> | vegeta report
</code></pre></div></div>

<p><strong>Results (45 requests, 3 req/s, max_tokens=20):</strong></p>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Value</th>
      <th>Notes</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>p50 latency</td>
      <td><strong>232ms</strong></td>
      <td>localhost:9000 → kind → nginx → vllm-metal</td>
    </tr>
    <tr>
      <td>p95 latency</td>
      <td><strong>277ms</strong></td>
      <td> </td>
    </tr>
    <tr>
      <td>Min latency</td>
      <td><strong>163ms</strong></td>
      <td>Mostly proxy traversal + tiny inference</td>
    </tr>
    <tr>
      <td>Success rate</td>
      <td><strong>100%</strong></td>
      <td>45/45 — gateway adds no failures</td>
    </tr>
  </tbody>
</table>

<p>The 163ms minimum reflects proxy traversal cost, not inference time — <code class="language-plaintext highlighter-rouge">max_tokens=20</code> on a 0.6B model generates tokens in tens of milliseconds. The meaningful result is 100% success at sustained rate. The gateway adds overhead but is not a bottleneck and does not drop requests.</p>

<p>In production, Envoy adds single-digit milliseconds — it’s purpose-built for high-throughput proxying. The nginx simulation here adds more, but the structural lesson holds: a gateway layer in front of vLLM does not meaningfully affect inference latency. What matters in llm-d is the EPP’s routing intelligence — prefix cache scoring, queue depth scoring, KV utilisation scoring — not the proxy overhead itself.</p>

<hr />

<h2 id="what-grafana-showed--the-full-picture">What Grafana Showed — The Full Picture</h2>

<p>The dashboard below captures the most informative view — TTFT latency percentiles, the prefill vs decode time split, finish reasons, and prompt length distribution:</p>

<figure style="max-width:900px;margin:2rem auto;text-align:center;">
  <img src="/assets/images/llm-inference/locust-test-3.png" alt="Grafana full overview — TTFT latency p50/p95/p99, Prefill and Decode Time separated, Finish Reason, Request Prompt Length" style="width:100%;" />
  <figcaption style="font-size:0.85rem;color:#888;margin-top:0.5rem;">
    Top-left: Time To First Token Latency — P99 spikes to ~7s under load while P50 stays under 2s. The gap between P50 and P99 is the prefill queue effect.
    Top-right: Request Prompt Length heatmap — two clusters confirm short vs long prompt traffic mix is being exercised correctly.
    Bottom-left: Finish Reason — "length" dominates, meaning requests hit max_tokens as expected. No unexpected aborts or errors.
    Bottom-right (highlighted): Requests Prefill and Decode Time — green is prefill, yellow is decode. Prefill varies with prompt length; decode stays flat. This is the visual proof of the two-phase separation. In a P/D disaggregated deployment, these two lines come from separate pod pools.
  </figcaption>
</figure>

<p>Three takeaways from Grafana that weren’t obvious before running the experiments:</p>

<p><strong>TPOT is a red herring under load.</strong> Inter Token Latency stayed flat throughout every experiment — even when E2E latency climbed to 31 seconds for long prompts. The per-token decode speed is stable. The problem is always pre-decode queuing.</p>

<p><strong>The prefill/decode time split is visible and separable.</strong> Prefill varies with prompt length, decode stays constant. In a disaggregated setup each line would come from a different pool of pods, independently scalable.</p>

<p><strong>KV cache utilisation is your early warning system.</strong> Peak of 4% during these experiments — plenty of headroom for a 0.6B model. On a larger model or busier system, the moment this crosses 85% is your fire alarm.</p>

<hr />

<h2 id="the-connection-to-week-2">The Connection to Week 2</h2>

<p>Everything measured this week points to the same structural problem: in aggregated serving, prefill and decode compete for the same resources. Long prompts delay short ones. High-concurrency workloads cause cascading TTFT degradation that no amount of hardware scaling can fully fix — because the problem is architectural, not computational.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>── AGGREGATED SERVING (what Week 1 showed) ──────────────────────────

Request A (long prompt)  → [  prefill 487ms  ][  decode decode decode ...  ]
Request B (short prompt) → [WAITING...       ][  prefill 341ms  ][ decode  ]
Request C (short prompt) → [WAITING.....................][prefill ][ decode  ]

── P/D DISAGGREGATION (what Week 2 will fix) ────────────────────────

Prefill pool: [ A-prefill ][ B-prefill ][ C-prefill ]  ← compute-bound
Decode pool:  [ A-decode  ][ B-decode  ][ C-decode  ]  ← memory-bandwidth-bound

Decode pool never blocks on prefill. TTFT stays consistent under load.
</code></pre></div></div>

<p>The numbers from this week — TTFT 341ms vs 487ms, prefix cache 51% reduction, 31-second long-prompt p50 under load — are the baseline. Week 2 is the comparison.</p>

<hr />

<h2 id="the-complete-benchmark-reference">The Complete Benchmark Reference</h2>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Hardware: M4 Mac Mini, 16GB unified memory
Model:    mlx-community/Qwen3-0.6B-4bit
Engine:   vllm-metal 0.13.0

─── Single request baseline ──────────────────────────────
  TTFT (short prompt, ~20 tokens):  341ms
  TTFT (long prompt, ~500 tokens):  487ms  (+42%)
  TPOT (short):                     3.4ms  (≈ 294 tok/s)
  TPOT (long):                      4.2ms  (barely changed)
  Throughput (single request):      145.8 tok/s

─── Prefix cache ─────────────────────────────────────────
  Cold TTFT (first request):        753ms
  Warm TTFT (same prefix):          367ms  (51% reduction)
  Session cache hit ratio:          17.7%

─── KV pressure (4 concurrent × 500 tokens) ─────────────
  Wall time (all 4 concurrent):     24.9s
  Per-request throughput:           20.1 tok/s
  Result: continuous batching confirmed

─── Mixed load (5 users, 3:1 short/long via Locust) ─────
  short_prompt: avg=9,471ms  min=710ms  p50=1,900ms  max=23,219ms
  long_prompt:  avg=16,830ms min=3,152ms p50=31,000ms max=30,508ms

─── Ollama vs vLLM (3 req/s, 30s, vegeta) ───────────────
  Ollama / mistral 7B p50:         15,979ms  (reference — 10× larger)
  Ollama / qwen2.5:0.5b p50:       14,062ms  (primary comparison)
  Ollama / qwen3.5 (large):        timeout   (2.2% success)
  vLLM  / Qwen3-0.6B-4bit p50:     6,543ms
  vLLM advantage at p50:            2.15× faster
  vLLM advantage at p95:            1.83× faster

─── nginx K8s gateway (max_tokens=20, 45 requests) ──────
  p50:  232ms  |  p95: 277ms  |  min: 163ms  |  Success: 100%
  Note: mostly proxy overhead — max_tokens=20 is tiny inference
</code></pre></div></div>

<hr />

<h2 id="what-this-points-to">What This Points To</h2>

<p>The Mac Mini experiments answered the questions they were designed for. TTFT/TPOT/KV cache behave exactly as the theory predicts. Continuous batching is real and measurable. The gateway layer adds overhead but doesn’t drop requests. And Ollama’s lack of continuous batching is not a footnote — it’s the difference between a useful serving system and one that falls over at 3 requests per second with a moderately sized model.</p>

<p>What the Mac Mini can’t answer: what happens when you separate prefill and decode onto dedicated hardware? What does EPP prefix-cache-aware routing look like in practice? And eventually — how do TGI, TensorRT-LLM, and SGLang compare under the same load test conditions? Those experiments need cloud GPU budget earmarked for specific Week 2 and Week 3 labs. When they happen, they’ll get their own posts with real numbers.</p>

<hr />

<p><strong>Next up:</strong> Post 3 covers llm-d deployment on a Lambda Labs GH200 — the ten things nobody tells you before you try to run a Helm-based inference scheduler on K3s, including the NIXL/RDMA failure that explains why single-GPU P/D disaggregation doesn’t work the way you’d hope.</p>

<hr />

<p><em>All experiments run on M4 Mac Mini 16GB, vllm-metal 0.13.0, Qwen3-0.6B-4bit. Scripts will be made available soon via github repository. I’m an platform engineer with 11+ years in distributed systems currently going deep on LLM serving. I write what I actually measured, including the parts that hit walls.</em></p>

<p><em><a href="https://github.com/kraghavan">GitHub</a> · <a href="https://linkedin.com/in/karthikaraghavan">LinkedIn</a></em></p>]]></content><author><name>Karthika Raghavan</name></author><category term="llm-infrastructure" /><category term="inference" /><category term="vllm" /><category term="ollama" /><category term="apple-silicon" /><category term="m4" /><category term="prometheus" /><category term="grafana" /><category term="nginx" /><category term="kubernetes" /><category term="benchmarks" /><category term="sre" /><summary type="html"><![CDATA[I treated an M4 Mac Mini as a production-like inference environment — wired up Prometheus, Grafana, a kind cluster with nginx, and ran real load tests. Here's what the numbers actually showed.]]></summary></entry><entry><title type="html">What Is LLM Inference, Really? A Deep Technical Walkthrough</title><link href="https://kraghavan.github.io/llm-infrastructure/inference/2026/04/14/re-introduction-to-inference.html" rel="alternate" type="text/html" title="What Is LLM Inference, Really? A Deep Technical Walkthrough" /><published>2026-04-14T00:00:00+00:00</published><updated>2026-04-14T00:00:00+00:00</updated><id>https://kraghavan.github.io/llm-infrastructure/inference/2026/04/14/re-introduction-to-inference</id><content type="html" xml:base="https://kraghavan.github.io/llm-infrastructure/inference/2026/04/14/re-introduction-to-inference.html"><![CDATA[<p>Let me be honest with you. When I started working on LLM infrastructure, I had eleven years of distributed systems experience. I knew Kafka, Kubernetes, Prometheus. I could debug a partition rebalance in my sleep.</p>

<p>And yet the first time someone asked me <em>what actually happens during inference</em>, I said something like “the model reads the prompt and generates tokens.” Which is technically true the same way “a database reads your query and returns rows” is technically true — accurate, useless, and deeply embarrassing for someone drawing a principal engineer’s salary.</p>

<p>This post is what I wish I’d had on day one. We’re going to walk through the entire inference pipeline — from the moment your request arrives to the moment you see text on screen — with real examples, honest explanations of where the performance goes, and enough detail that you can actually reason about production problems.</p>

<p>No “and then the transformer does its thing.” No skipped steps. Strap in.</p>

<hr />

<figure style="max-width:480px;margin:2rem auto;text-align:center;">
  <img src="/assets/images/llm-inference/gemini-generated-llm-inference-pipeline.png" alt="LLM Inference Pipeline" style="width:100%;" />
</figure>

<hr />

<h2 id="1-what-is-inference">1. What Is Inference?</h2>

<p><strong>Training</strong> is where you take a massive dataset, run it through a model millions of times, and slowly adjust billions of numerical weights until the model gets good at predicting the next word. Training is done once (or occasionally). It costs millions of dollars in GPU-hours and requires a team of researchers.</p>

<p><strong>Inference</strong> is what happens afterward, every time someone uses the model. It’s the model <em>using</em> those learned weights to respond to new input. No weights change. No learning happens. It’s pure forward-pass computation.</p>

<p>Think of it like this: training is baking the bread. Inference is slicing it and serving it to customers. The bread (weights) is done. The kitchen (inference engine) just has to plate it fast enough that the queue doesn’t back up to the street.</p>

<p>The inference engine is the runtime that takes the frozen model weights and executes them against an input. The same weights can run on Ollama, vLLM, TensorRT-LLM, or TGI — and get meaningfully different performance from each. The weights don’t change. The execution strategy does.</p>

<p>This distinction matters operationally: <strong>inference is not a solved problem</strong>. Serving a model efficiently at scale is a full engineering discipline.</p>

<hr />

<h2 id="2-the-artifact-whats-actually-in-that-10gb-download">2. The Artifact: What’s Actually in That 10GB Download?</h2>

<p>When you run <code class="language-plaintext highlighter-rouge">ollama pull mistral</code> or grab a model from HuggingFace, you aren’t just downloading a “program.” You’re downloading a massive, frozen brain in a box. If you’ve ever wondered why a model that “just chats” takes up 10GB of your SSD, it’s because it is packed with billions of tiny numerical “preferences” the model learned during its training phase.</p>

<p>Think of the <strong>GGUF</strong> (or <strong>Safetensors</strong>) file as a giant Ikea flat-pack box. To build the working model, you need two things: the <strong>Instruction Manual</strong> and the <strong>Hardware</strong>.</p>

<p>What’s inside a 7B parameter model file:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>GGUF file structure (simplified):
├── Header
│   ├── Model architecture (LlamaForCausalLM)
│   ├── Vocabulary (32000 tokens + their embeddings)
│   ├── Context length (4096, 8192, etc.)
│   └── Hyperparameters (n_layers, n_heads, etc.)
│
└── Weight tensors:
    ├── token_embeddings        [32000 × 4096]   ← the embedding matrix
    ├── layer.0.attention.q     [4096 × 4096]    ← Query projection weights
    ├── layer.0.attention.k     [4096 × 4096]    ← Key projection weights
    ├── layer.0.attention.v     [4096 × 4096]    ← Value projection weights
    ├── layer.0.attention.out   [4096 × 4096]    ← Output projection
    ├── layer.0.ffn.up          [4096 × 11008]   ← Feed-forward up
    ├── layer.0.ffn.down        [11008 × 4096]   ← Feed-forward down
    ├── ... × 32 layers
    └── output_norm + lm_head   [32000 × 4096]   ← Final projection to logits
</code></pre></div></div>

<h3 id="the-manual-the-header">The “Manual” (The Header)</h3>
<p>This is the first few kilobytes of the file. It tells the inference engine (like Ollama or vLLM) how to put the brain together. It includes:</p>
<ul>
  <li><strong>The Architecture</strong>: Identifies the model type (e.g., <code class="language-plaintext highlighter-rouge">LlamaForCausalLM</code>) so the engine knows which math rules to apply.</li>
  <li><strong>The Vocabulary</strong>: A dictionary of roughly 32,000 to 128,000 “tokens” (the syllables the model speaks).</li>
  <li><strong>The Hyperparameters</strong>: Crucial settings like the number of layers (32 or 80) and the context length (how much it can remember).</li>
</ul>

<h3 id="the-hardware-the-tensors">The “Hardware” (The Tensors)</h3>
<p>The rest of that file is just rows and rows of numbers called <strong>Weights</strong>. Every inference request is essentially looking up values from these matrices and multiplying them together—32 times over.</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">The “Part”</th>
      <th style="text-align: left">What it actually does in plain English</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left"><strong><code class="language-plaintext highlighter-rouge">token_embeddings</code></strong></td>
      <td style="text-align: left"><strong>The Translator.</strong> Turns human text into the model’s internal number-language.</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong><code class="language-plaintext highlighter-rouge">attention.q, k, v</code></strong></td>
      <td style="text-align: left"><strong>The Highlighters.</strong> Helps the model decide which part of your sentence is important.</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong><code class="language-plaintext highlighter-rouge">ffn.up</code> &amp; <code class="language-plaintext highlighter-rouge">ffn.down</code></strong></td>
      <td style="text-align: left"><strong>The Reasoning Muscles.</strong> Does the heavy lifting of processing and transforming information.</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong><code class="language-plaintext highlighter-rouge">lm_head</code></strong></td>
      <td style="text-align: left"><strong>The Microphone.</strong> Turns the final internal math back into a word you can read.</td>
    </tr>
  </tbody>
</table>

<h3 id="quantization-shrinking-the-brain">Quantization: Shrinking the Brain</h3>
<p>You might notice some files are 15GB while others are 4GB for the same model. This is <strong>Quantization</strong>—the art of compression. We turn high-precision 16-bit floats into lower-precision integers (like 4-bit).</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">Precision</th>
      <th style="text-align: left">Bits per weight</th>
      <th style="text-align: left">7B Model Size</th>
      <th style="text-align: left">The SRE Reality</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left"><strong>FP16</strong></td>
      <td style="text-align: left">16</td>
      <td style="text-align: left">~14GB</td>
      <td style="text-align: left">Requires an A100. Pristine quality.</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>INT8</strong></td>
      <td style="text-align: left">8</td>
      <td style="text-align: left">~7GB</td>
      <td style="text-align: left">Fits on a high-end gaming GPU. Minimal loss.</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>INT4 (Q4_K_M)</strong></td>
      <td style="text-align: left">4</td>
      <td style="text-align: left">~4GB</td>
      <td style="text-align: left"><strong>The Sweet Spot.</strong> Fits on a MacBook. Faster throughput.</td>
    </tr>
  </tbody>
</table>

<p><strong>Why SREs love INT4:</strong> Lower precision = smaller tensors = faster memory transfers. Because decoding is memory-bound, an INT4 model often delivers <strong>20-40% better TPOT</strong> (tokens per second) than the “full” version because the memory bus isn’t screaming as loud.</p>

<p><strong>The takeaway:</strong> You aren’t executing code; you are loading a massive, math-heavy lookup table. GGUF is your single-file “box,” and quantization is how you fit that box into a smaller truck (your GPU).</p>

<hr />

<h2 id="3-the-three-phases-a-map-before-the-territory">3. The Three Phases: A Map Before the Territory</h2>

<p>Every inference request goes through three broad phases. They are not equally expensive, not equally parallelizable, and not equally friendly to your p99 latency.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>┌──────────────────────────────────────────────────────────────────┐
│  TOKENIZATION  │  Prefill (Prompt Processing)   │ Decode         │
│    (CPU)       │           (GPU, compute)       │ Loop(GPU, Mem) │
│                │                                │                │
│                │                                │ current →      │
│  Text → IDs    │  Embed → Position → Attention  │ next token     │
└──────────────────────────────────────────────────────────────────┘
     Fast          Scales with prompt length        Slow
</code></pre></div></div>

<ul>
  <li><strong>Tokenization</strong>: Split the text into token IDs the model understands. CPU-bound. Fast.</li>
  <li><strong>Prefill</strong>: Process the entire prompt through the model. GPU compute-bound. Scales with prompt length.</li>
  <li><strong>Decode</strong>: Generate output tokens one at a time. GPU memory-bound. Runs in a loop until done.</li>
</ul>

<p>Each phase has its own bottleneck. Let’s go through them one by one.</p>

<hr />

<h2 id="4-tokenization-chopping-text-into-numbers">4. Tokenization: Chopping Text Into Numbers</h2>

<p>Before a single GPU operation happens, your text has to be converted into a format the model can work with: a sequence of integers called token IDs.</p>

<p>A token is not a character, and it’s not a word. It’s a chunk of text that appears frequently enough in the training corpus to deserve its own ID. There are several ways to build this vocabulary — WordPiece (used by BERT), Unigram (used by SentencePiece), and others — but the dominant approach in modern LLMs is Byte Pair Encoding (BPE): a compression algorithm that iteratively merges the most common pairs of characters or subwords into single tokens until it reaches a target vocabulary size.</p>

<p>The result is a vocabulary of roughly 32,000–128,000 tokens, each with a corresponding integer ID. The model never sees your text — it sees a list of numbers.</p>

<figure style="max-width:720px;margin:2rem auto;text-align:center;">
  <img src="/assets/images/llm-inference/encoding-bpe.png" alt="Tokenization with BPE" style="width:100%;" />
</figure>

<p>Take our example prompt: <code class="language-plaintext highlighter-rouge">"The cat sat"</code></p>

<p>After tokenization, this becomes something like:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"The"  → 1026
" cat" → 5992
" sat" → 3290
</code></pre></div></div>
<p>Token IDs: <code class="language-plaintext highlighter-rouge">[1026, 5992, 3290]</code></p>

<p>Note the space before “cat” and “sat” — it’s part of the token. Tokenizers care about whitespace because it affects meaning and frequency.</p>

<h3 id="is-tokenization-cpu-bound">Is Tokenization CPU-Bound?</h3>

<p>Yes. The tokenizer is usually written in Rust (HuggingFace’s <code class="language-plaintext highlighter-rouge">tokenizers</code> crate) or C++ for exactly this reason. For most requests it’s fast enough to be invisible — microseconds for a short prompt.</p>

<p>Where it bites you: <strong>very long documents</strong> fed to batch processing jobs. A 100,000-token context requires processing 100,000 token lookups. It’s still fast relative to GPU work, but it’s the one step in the pipeline running on CPU that you can’t just throw more GPU at.</p>

<p><strong>How it’s improved:</strong> Parallelizing tokenization across CPU cores for batch workloads. Or — and this is the real fix — <strong>not re-tokenizing the same content repeatedly</strong>. If you have a shared system prompt you send to every request, tokenizing it once and caching the result is free latency.</p>

<hr />

<h2 id="5-prefill-the-model-reads-your-prompt">5. Prefill: The Model Reads Your Prompt</h2>

<p>Now we have token IDs. The model needs to turn those IDs into something it can reason about. This is <strong>prefill</strong> — the model processing the entire prompt in one shot.</p>

<p>Prefill has two sub-steps that are easy to conflate: <strong>embedding lookup</strong> and the actual <strong>transformer forward pass</strong>. Let’s take them in order.</p>

<h3 id="the-embedding-matrix">The Embedding Matrix</h3>

<p>Every token ID maps to a high-dimensional vector of floating-point numbers called an <strong>embedding</strong>. These vectors live in the model’s embedding matrix — a giant lookup table with one row per vocabulary token and one column per embedding dimension.</p>

<p>For a model with a 32,000-token vocabulary and 4,096 embedding dimensions, this matrix has shape <code class="language-plaintext highlighter-rouge">[32000, 4096]</code>. At 16-bit float precision, that’s about 256MB just for the embedding layer.</p>

<p>Our example <code class="language-plaintext highlighter-rouge">[1026, 5992, 3290]</code> becomes:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Token ID 1026 → embedding row 1026 → [0.12, -0.43, 0.81, ..., 0.07]  (4096 values)
Token ID 5992 → embedding row 5992 → [-0.34, 0.91, 0.12, ..., -0.22] (4096 values)
Token ID 3290 → embedding row 3290 → [0.67, 0.05, -0.88, ..., 0.44]  (4096 values)
</code></pre></div></div>

<p>I’m simplifying to 8 dimensions here so this fits on a page. In reality it’s 4,096 or 8,192 dimensions depending on the model.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Simplified (3D instead of 4096D), just to show the shape:

"The"  → [0.12, -0.43,  0.81]
" cat" → [-0.34,  0.91,  0.12]
" sat" → [0.67,  0.05, -0.88]

Shape: [3 tokens × 3 dims] = a matrix of floats
</code></pre></div></div>

<p>These vectors aren’t random. They’re the result of training — the model has learned that “cat” and “dog” live close together in this space, and “cat” and “quantum mechanics” are far apart. The geometry encodes semantic meaning.</p>

<figure style="max-width:720px;margin:2rem auto;text-align:center;">
  <img src="/assets/images/llm-inference/embedding-lookup-process.png" alt="Embedding Lookup Process" style="width:100%;" />
</figure>

<h3 id="how-do-the-model-weights-help-here">How Do the Model Weights Help Here?</h3>

<p>The embedding matrix IS the model weights, specifically. The 10GB (or 40GB, or 70GB) file you download — the GGUF or safetensors file — contains all the weight matrices the model learned during training. The embedding lookup is literally indexing into one of those weight matrices by row number.</p>

<p>When you run inference, you’re not computing anything creative. You’re doing matrix math against frozen numbers that were tuned over millions of training iterations.</p>

<hr />

<h2 id="6-positional-embeddings-teaching-the-model-about-order">6. Positional Embeddings: Teaching the Model About Order</h2>

<p>Here’s a problem: the embedding lookup is a table lookup. It doesn’t care that “cat” is token 2 and “sat” is token 3. Two requests with the same tokens in different orders would produce identical embeddings.</p>

<p>But order matters enormously. “The cat sat on the dog” and “The dog sat on the cat” have the same tokens and very different meanings.</p>

<p><strong>Positional embeddings</strong> solve this by adding a position-aware vector to each token’s embedding. The model learns that “token at position 1” feels different from “token at position 5,” even if the token ID is the same.</p>

<h3 id="how-is-it-calculated">How Is It Calculated?</h3>

<p>There are two main approaches:</p>

<p><strong>Sinusoidal (original Transformers paper):</strong> Compute a fixed sine/cosine wave pattern based on position and dimension index. Deterministic, no learned parameters.</p>

<p><strong>RoPE (Rotary Position Embedding):</strong> Used by Llama, Qwen, Mistral, and most modern models. Instead of adding a vector, it <em>rotates</em> the query and key vectors by an angle proportional to position. The result: the dot product between two token representations naturally captures their relative distance. Elegant, and generalizes better to longer contexts than the training data.</p>

<p>Continuing our example. After adding positional information:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"The"  at position 0: [0.12, -0.43, 0.81] + pos(0) → [0.15, -0.40, 0.84]
" cat" at position 1: [-0.34, 0.91, 0.12] + pos(1) → [-0.31, 0.88, 0.09]
" sat" at position 2: [0.67, 0.05, -0.88] + pos(2) → [0.65, 0.03, -0.86]
</code></pre></div></div>

<p>The position vectors are small adjustments. Their real value is that when attention is computed later, the model can tell whether two tokens are adjacent or 200 positions apart.</p>

<figure style="max-width:720px;margin:2rem auto;text-align:center;">
  <img src="/assets/images/llm-inference/positional-embeddings-diagram.png" alt="Positional Embeddings Diagram" style="width:100%;" />
</figure>

<h3 id="cpu-bottleneck-in-prefill">CPU Bottleneck in Prefill?</h3>

<p>Embedding lookup and positional encoding are fast operations. The real CPU bottleneck in prefill is less about these steps and more about <strong>data movement</strong>: loading the right weight tensors from CPU RAM to GPU VRAM before the transformer forward pass can begin.</p>

<p>For very large models that don’t fully fit in VRAM, the CPU-GPU transfer becomes the bottleneck — you’re constantly paging weight blocks in. This is why model quantization matters: a 4-bit quantized model uses less VRAM, fits entirely on GPU, and eliminates this transfer overhead. More on that in a moment.</p>

<hr />

<h2 id="7-the-transformer-layers-where-the-real-work-happens">7. The Transformer Layers: Where the Real Work Happens</h2>

<p>After embedding + positional encoding, we have a matrix of shape <code class="language-plaintext highlighter-rouge">[sequence_length × embedding_dim]</code>. This matrix now passes through N transformer layers — 32 layers for Llama-3.2-3B, 80 layers for Llama-3.1-70B.</p>

<p>Each layer applies:</p>
<ol>
  <li><strong>Self-attention</strong>: every token looks at every other token and decides what’s relevant</li>
  <li><strong>Feed-forward network (FFN)</strong>: each token’s representation is independently transformed</li>
</ol>

<p>This is where the model’s “reasoning” happens — and where most of the GPU compute goes during prefill. All tokens are processed in parallel within a layer, making prefill compute-bound. More tokens = more compute = higher TTFT.</p>

<p>We’ll cover the attention mechanism in detail in section 9. First, let’s see what comes out.</p>

<hr />

<h2 id="8-decoding-one-token-at-a-time-forever">8. Decoding: One Token at a Time, Forever</h2>

<p>After prefill, the model produces its first output token. Then it produces another. Then another. Each token depends on all previous tokens. This is the <strong>decode loop</strong>.</p>

<p>Here’s what makes decode fundamentally different from prefill: <strong>you can’t parallelize it</strong>. Token N can’t be computed until token N-1 exists. It’s inherently sequential.</p>

<p>Let’s walk through two steps with our example. Our prompt was “The cat sat” and let’s say the model is going to output “on the mat.”</p>

<h3 id="decode-step-1-predicting-on">Decode Step 1: Predicting “on”</h3>

<p>After prefill, we have KV cache entries for “The”, “ cat”, “ sat” (we’ll explain KV cache shortly). Now:</p>

<ol>
  <li>The model takes the last token’s representation (“ sat”) and runs it through the transformer layers</li>
  <li>At each layer, attention is computed between “ sat” and all previous tokens via the KV cache</li>
  <li>The final layer outputs a vector of size <code class="language-plaintext highlighter-rouge">[vocabulary_size]</code> — one score per possible next token. This is called the <strong>logits</strong> vector.</li>
  <li>The logits are converted to probabilities via softmax</li>
  <li>A token is sampled from this distribution (more below)</li>
  <li>Result: token ID for “ on” → <code class="language-plaintext highlighter-rouge">" on"</code> is emitted as the first output token</li>
</ol>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>KV cache: ["The", " cat", " sat"]
Current:  " sat" (last input token)
Attention: " sat" attends to "The", " cat", " sat"
Output logits: [0.001, 0.003, ..., 0.45 (" on"), ..., 0.002]
Sample: " on" ✓
</code></pre></div></div>

<h3 id="decode-step-2-predicting-the">Decode Step 2: Predicting “the”</h3>

<p>Now “ on” has been generated. We add it to context:</p>

<ol>
  <li>Embed token “ on” → one new embedding vector (just <em>one</em> token, not the whole sequence)</li>
  <li>Add positional embedding for position 4</li>
  <li>Run through transformer layers, attending to KV cache for [“The”, “ cat”, “ sat”, “ on”]</li>
  <li>Output logits → sample → “ the”</li>
</ol>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>KV cache: ["The", " cat", " sat", " on"]  ← one entry added
Current:  " on" (new last token)
Attention: " on" attends to all four previous tokens
Output logits: [..., 0.67 (" the"), ...]
Sample: " the" ✓
</code></pre></div></div>

<p>And so it continues: “ mat” → “.” → <code class="language-plaintext highlighter-rouge">&lt;end&gt;</code> token → stop.</p>

<figure style="max-width:720px;margin:2rem auto;text-align:center;">
  <img src="/assets/images/llm-inference/key-value-cache-growth.png" alt="Key Value Cache Growth" style="width:100%;" />
</figure>

<h3 id="the-sampling-step-where-creativity-lives">The Sampling Step (Where Creativity Lives)</h3>

<p>The logits give you a probability distribution. How you sample from it is where temperature, top-k, and top-p come in:</p>

<ul>
  <li><strong>Greedy (temperature=0)</strong>: always pick the highest probability token. Deterministic. Boring for creative tasks, good for code.</li>
  <li><strong>Temperature &gt; 1</strong>: flatten the distribution. More randomness, more surprising outputs, more hallucinations.</li>
  <li><strong>Temperature &lt; 1</strong>: sharpen the distribution. More conservative, more predictable.</li>
  <li><strong>Top-k</strong>: only sample from the top K most probable tokens. Ignores the long tail.</li>
  <li><strong>Top-p (nucleus sampling)</strong>: only sample from the smallest set of tokens whose cumulative probability exceeds p. Adaptive — sometimes that’s 2 tokens, sometimes 50.</li>
</ul>

<p>This step is trivially cheap computationally but has enormous impact on output quality. As an SRE, you don’t usually tune this — but you will get bug reports when someone’s temperature=2.0 config makes the model output Shakespeare from a JSON API endpoint.</p>

<hr />

<h2 id="9-why-memory-is-the-decode-bottleneck">9. Why Memory Is the Decode Bottleneck</h2>

<p>Here’s the thing about decode that makes it hard to optimize: on every single decode step, the model needs to run attention against <strong>all previous tokens</strong>. Not a summary of them. All of them. Via the KV cache.</p>

<p>For a 1000-token conversation, every decode step reads 1000 rows of KV tensors from GPU memory. For a 32-layer model, that’s 32 reads of a large tensor. For a 7B model, the KV entry for a single token at a single layer is tens of kilobytes.</p>

<p>The GPU’s compute cores can execute these operations fast. But they’re waiting on HBM (High Bandwidth Memory) to deliver the data. The memory bus saturates before the compute does.</p>

<p><strong>The GPU is memory-bound during decode, not compute-bound.</strong></p>

<p>This is why adding more CUDA cores doesn’t help decode performance as much as adding more memory bandwidth. It’s why H100s are faster than A100s for serving despite similar compute specs — the memory bandwidth jump matters more for decode than the FLOP count.</p>

<p>A rough intuition: during prefill, GPU utilization is high and memory is barely stressed. During decode, GPU compute is mostly idle and the memory bus is screaming.</p>

<hr />

<h2 id="10-kv-cache-the-most-important-data-structure-in-inference">10. KV Cache: The Most Important Data Structure in Inference</h2>

<p>We keep mentioning the KV cache. Let’s make it concrete.</p>

<p>During the attention step in each transformer layer, every token produces two vectors: a <strong>Key</strong> (K) and a <strong>Value</strong> (V). These are used in attention computation: other tokens use your Key to decide how much to attend to you, and then use your Value to extract information from you.</p>

<p>During decode, token N needs to compute attention against tokens 1 through N-1. If we recomputed K and V for all previous tokens on every step, we’d be doing O(N²) work per decode step. That’s catastrophic.</p>

<p>The KV cache solves this: after computing K and V for a token, we <strong>store</strong> them. On the next decode step, we only compute K and V for the new token and look up the rest from cache.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>After prefill of "The cat sat":
KV cache = {
  layer_0: { K: [k_The, k_cat, k_sat], V: [v_The, v_cat, v_sat] }
  layer_1: { K: [...], V: [...] }
  ...  (32 layers total)
}

After generating " on":
KV cache = {
  layer_0: { K: [k_The, k_cat, k_sat, k_on], V: [v_The, v_cat, v_sat, v_on] }
  ...
}
← one new entry appended per layer per decode step
</code></pre></div></div>

<p>The cache grows with every token generated. When the KV cache fills GPU memory:</p>
<ul>
  <li>New requests queue (they have nowhere to store their KV tensors)</li>
  <li>Long requests get partially evicted and have to recompute (latency spike)</li>
  <li>In the worst case: OOM crash</li>
</ul>

<p>KV cache occupancy is the single most important resource to monitor in a serving system. It determines how many concurrent requests you can serve and how long those requests can be. When <code class="language-plaintext highlighter-rouge">vllm:kv_cache_usage_perc</code> starts approaching 0.9, you’re about to have a bad time.</p>

<h3 id="prefix-caching-free-speedups">Prefix Caching: Free Speedups</h3>

<p>If two requests share the same system prompt, their KV cache entries for that prefix are identical. Prefix caching stores those entries once and reuses them.</p>

<p>In practice on my M4 Mac Mini:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Without prefix cache (cold):  TTFT 753ms
With prefix cache (warm):     TTFT 367ms
Savings:                       51% reduction in TTFT
Zero model changes required.
</code></pre></div></div>

<p>This is why your production system prompt should be at the beginning of every request, and why it should be identical byte-for-byte across requests. Drift in the system prompt = cache misses = higher TTFT = sad users.</p>

<hr />

<h2 id="11-attention-the-mechanism-that-makes-it-work">11. Attention: The Mechanism That Makes It Work</h2>

<p>Let’s go one level deeper into what happens at each transformer layer. Attention is the core operation. Everything else is bookkeeping.</p>

<h3 id="self-attention-in-plain-english">Self-Attention in Plain English</h3>

<p>For each token, attention computes: <em>“which other tokens should I be paying attention to, and by how much?”</em></p>

<p>It does this via three learned projections of each token’s embedding:</p>
<ul>
  <li><strong>Query (Q)</strong>: “what am I looking for?”</li>
  <li><strong>Key (K)</strong>: “what do I offer?”</li>
  <li><strong>Value (V)</strong>: “what information do I contain?”</li>
</ul>

<p>For each token, you compute its dot product with every other token’s Key. High dot product = high attention score = attend more to that token. The scores are normalized via softmax, then used to weight-sum all the Value vectors.</p>

<p>Example with “The cat sat”:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Token: " sat"
Q_sat · K_The  = 0.8  → attend heavily to "The"
Q_sat · K_cat  = 0.9  → attend most to " cat" (makes sense)
Q_sat · K_sat  = 0.3  → attend a little to itself

Attention weights after softmax: [0.35, 0.55, 0.10]

Output = 0.35 × V_The + 0.55 × V_cat + 0.10 × V_sat
</code></pre></div></div>

<p>The output for “ sat” is now a blend of information from all tokens, weighted by relevance. After 32 such layers, the model has a rich, contextualized representation of every token in the sequence.</p>

<figure style="max-width:720px;margin:2rem auto;text-align:center;">
  <img src="/assets/images/llm-inference/attention-mechanism.png" alt="Attention Mechanism" style="width:100%;" />
</figure>

<h3 id="paged-attention-virtual-memory-for-kv-cache">Paged Attention: Virtual Memory for KV Cache</h3>

<p>Standard attention assumes the KV cache is one contiguous block of memory per request. This is wasteful: you have to pre-allocate the maximum possible sequence length upfront, and if the request ends early, that memory is wasted until the request completes.</p>

<p><strong>PagedAttention</strong> (the key innovation in vLLM) borrows from OS virtual memory. Instead of one contiguous block, KV cache is stored in <strong>fixed-size pages</strong> (blocks) that can be non-contiguous in physical GPU memory. A page table maps logical token positions to physical blocks.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Without PagedAttention:
  Request A: [KKKKKKKKKK........] ← pre-allocated 20 slots, using 10, wasting 10
  Request B: [KKKKKK............] ← pre-allocated 20 slots, using 6, wasting 14

With PagedAttention:
  Block pool: [B1][B2][B3][B4][B5][B6][B7][B8]...
  Request A:  blocks B1, B3, B7 (non-contiguous, allocated on demand)
  Request B:  blocks B2, B4    (uses only what it needs)
</code></pre></div></div>

<p>The result: <strong>much higher memory utilization</strong>, more concurrent requests, less waste. Prefix cache blocks can be shared between requests with identical prefixes — only one copy of the system prompt’s KV entries needed, regardless of how many requests use it.</p>

<p>PagedAttention is why vLLM typically serves 2-4x more concurrent requests than a naive implementation on the same hardware.</p>

<hr />

<h2 id="12-continuous-batching-the-throughput-unlock">12. Continuous Batching: The Throughput Unlock</h2>

<p>Early inference servers were naive: accept a batch of requests, run them all through the model together, return all responses. Simple.</p>

<p>The problem: requests finish at different times. Short requests had to wait for long requests to complete before the next batch could start. GPU utilization looked like a sawtooth wave.</p>

<p><strong>Continuous batching</strong> (also called iteration-level scheduling) fixes this. Instead of batching at the request level, the inference engine batches at the <strong>decode step</strong> level. Every iteration, it assembles the currently active tokens — some mid-generation, some just starting — into a single GPU operation.</p>

<p>When a request finishes, its slot is immediately available for a new request. When a new request arrives, it joins the active batch at the next iteration rather than waiting for the next batch boundary.</p>

<p>The result: GPU utilization stays high, latency for new requests is low, and throughput scales with the number of concurrent requests the KV cache can support — not with some fixed batch size parameter you tuned last Tuesday.</p>

<p>vLLM, TGI, and TensorRT-LLM all implement continuous batching. Ollama does not (as of early 2025). This is one of the primary reasons vLLM serves at 2x the throughput of Ollama under concurrent load.</p>

<hr />

<h2 id="13-the-metrics-that-matter">13. The Metrics That Matter</h2>

<p>Now that you know what’s happening, the metrics become obvious rather than mysterious.</p>

<figure style="max-width:900px;margin:2rem auto;text-align:center;">
  <img src="/assets/images/llm-inference/grafana-metrics.png" alt="Grafana Metrics Dashboard" style="width:100%;" />
  <figcaption style="font-size:0.85rem;color:#888;margin-top:0.5rem;">Real metrics from a running llm-d deployment: TTFT p50 at 15ms, ITL p50 at 5ms, KV cache prefix hit rate at 80.6% — exactly the four numbers you should have on your wall during an incident.</figcaption>
</figure>

<h3 id="ttft--time-to-first-token">TTFT — Time to First Token</h3>

<p><strong>Definition:</strong> Wall clock time from request submission to first output token.</p>

<p><strong>What it captures:</strong> Prefill time + queuing time. If your GPU is busy, TTFT absorbs the wait.</p>

<p><strong>PromQL:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>histogram_quantile(0.95,
  sum(rate(vllm:time_to_first_token_seconds_bucket[2m])) by (le)
)
</code></pre></div></div>

<p><strong>SLO guidance:</strong> &lt; 500ms for interactive chat. &lt; 2s is tolerable. &gt; 5s and users think it’s broken.</p>

<p><strong>Root causes when high:</strong></p>
<ul>
  <li>Long prompts (expected — prefill scales with length)</li>
  <li>GPU under heavy load, requests queuing</li>
  <li>Insufficient prefill capacity (in P/D disaggregated setups)</li>
</ul>

<hr />

<h3 id="itl--inter-token-latency-aka-tpot">ITL — Inter-Token Latency (aka TPOT)</h3>

<p><strong>Definition:</strong> Average time between consecutive output tokens during decode. Inverse of token generation speed.</p>

<p><strong>What it captures:</strong> Decode throughput per active request. Memory bandwidth is the primary lever.</p>

<p><strong>PromQL:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>histogram_quantile(0.95,
  sum(rate(vllm:time_per_output_token_seconds_bucket[2m])) by (le)
)
</code></pre></div></div>

<p><strong>SLO guidance:</strong> &lt; 30ms is fast, streaming feels smooth. &gt; 100ms and you notice the typewriter effect.</p>

<p><strong>Root causes when high:</strong></p>
<ul>
  <li>Too many concurrent requests (KV cache reads competing)</li>
  <li>Large model + small GPU memory bandwidth</li>
  <li>KV cache approaching capacity</li>
</ul>

<hr />

<h3 id="kv-cache-hit-ratio">KV Cache Hit Ratio</h3>

<p><strong>Definition:</strong> Fraction of prompt tokens whose KV vectors were served from prefix cache vs recomputed.</p>

<p><strong>What it captures:</strong> Effectiveness of prefix caching. High hit ratio = lower TTFT for repeated system prompts.</p>

<p><strong>PromQL:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vllm:gpu_prefix_cache_hit_rate
</code></pre></div></div>

<p><strong>Target:</strong> &gt; 0.5 for most chat workloads with consistent system prompts. Near 0 means your prompts are fully unique (batch processing, no shared prefix).</p>

<hr />

<h3 id="kv-cache-usage">KV Cache Usage</h3>

<p><strong>Definition:</strong> Fraction of total KV cache capacity currently occupied.</p>

<p><strong>PromQL:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vllm:gpu_cache_usage_perc
</code></pre></div></div>

<p><strong>Alert threshold:</strong> &gt; 0.85. At 0.9+, vLLM starts evicting in-progress requests, causing recomputation and latency spikes. At 1.0, new requests queue entirely.</p>

<hr />

<h3 id="scaling-strategy-by-traffic-shape">Scaling Strategy by Traffic Shape</h3>

<p>This is where the SRE work actually lives:</p>

<table>
  <thead>
    <tr>
      <th>Traffic Pattern</th>
      <th>Symptom</th>
      <th>Action</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Many short prompts, many users</td>
      <td>TTFT fine, ITL rising, KV cache full</td>
      <td>Scale out (more replicas), or reduce <code class="language-plaintext highlighter-rouge">max_model_len</code></td>
    </tr>
    <tr>
      <td>Few long prompts, long outputs</td>
      <td>TTFT high, ITL high, KV cache full fast</td>
      <td>Larger GPU memory, or P/D disaggregation</td>
    </tr>
    <tr>
      <td>Bursty traffic, idle baseline</td>
      <td>P99 TTFT spikes, P50 fine</td>
      <td>Horizontal scaling + request queuing</td>
    </tr>
    <tr>
      <td>Consistent system prompt across requests</td>
      <td>High TTFT on cold start only</td>
      <td>Enable prefix caching (already default in vLLM &gt;= 0.4)</td>
    </tr>
    <tr>
      <td>Mixed short and long context</td>
      <td>Unpredictable KV usage</td>
      <td>Set per-request <code class="language-plaintext highlighter-rouge">max_tokens</code> limits strictly</td>
    </tr>
  </tbody>
</table>

<p><strong>The strategic insight:</strong> short prompt, many concurrent users → <strong>decode is the bottleneck</strong>, optimize for memory bandwidth and parallelism across requests. Long context, few users → <strong>prefill is the bottleneck</strong> and KV cache pressure is the constraint; P/D disaggregation helps by giving prefill its own GPU.</p>

<hr />

<h2 id="14-where-does-the-time-actually-go">14. Where Does the Time Actually Go?</h2>

<p>After running these experiments on real hardware, here’s the honest answer:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Typical inference request (short prompt, moderate load):

Tokenization:          ~0.5ms   (CPU, negligible)
Embedding lookup:      ~1ms     (GPU memory read)
Prefill (32 layers):   ~40ms    (GPU compute, scales with prompt length)
First token decode:    ~3-5ms   (GPU memory read, KV cache write)
...each subsequent token: ~3-5ms

Total TTFT:            ~45ms under no load
Total TTFT at p95:     300-500ms under load (queuing dominates)
</code></pre></div></div>

<p><strong>Where load makes it worse:</strong></p>

<ol>
  <li><strong>Queuing before prefill starts</strong>: your request sits behind other long prefills. TTFT absorbs the entire queue wait.</li>
  <li><strong>KV cache contention during decode</strong>: more concurrent requests = more KV cache reads per step = higher ITL for everyone.</li>
  <li><strong>Memory fragmentation</strong>: without PagedAttention, wasted KV cache blocks reduce effective concurrency.</li>
</ol>

<p><strong>What’s predicted to improve this:</strong></p>

<ul>
  <li><strong>Speculative decoding</strong>: a small “draft” model generates 4-5 tokens speculatively; the large model verifies them in one forward pass. If accepted, 4 tokens for the price of ~1.5 forward passes. Reduces ITL dramatically under low concurrency, hurts at high concurrency (wasted draft compute).</li>
  <li><strong>P/D disaggregation</strong>: dedicated prefill GPUs handle prompt processing, dedicated decode GPUs handle generation. Eliminates the resource contention between phases. Requires fast interconnect (NVLink or RDMA) for KV transfer.</li>
  <li><strong>Flash Attention 3</strong>: kernel-level optimization that keeps attention computation in SRAM longer, reducing HBM reads. Already default in vLLM on H100.</li>
</ul>

<hr />

<h2 id="15-the-inference-engines-worth-knowing">15. The Inference Engines Worth Knowing</h2>

<p>Not all inference engines are created equal, and the right tool depends on your constraints.</p>

<figure style="max-width:800px;margin:2rem auto;text-align:center;">
  <img src="/assets/images/llm-inference/llm-inference-engine-comparison.png" alt="LLM Inference Engines Comparison" style="width:100%;" />
</figure>

<h3 id="ollama">Ollama</h3>
<ul>
  <li><strong>Origin:</strong> Open source, community</li>
  <li><strong>Strengths:</strong> Dead simple to run, supports Apple Silicon natively, GGUF format</li>
  <li><strong>Weaknesses:</strong> No continuous batching (as of early 2025), worse throughput under concurrent load</li>
  <li><strong>When to use:</strong> Local development, single-user experiments, quick model testing</li>
  <li><strong>When not to use:</strong> Serving more than one user concurrently</li>
  <li><strong>GitHub:</strong> <a href="https://github.com/ollama/ollama">github.com/ollama/ollama</a></li>
</ul>

<h3 id="vllm">vLLM</h3>
<ul>
  <li><strong>Origin:</strong> UC Berkeley, now a major open-source project with significant industry contributors</li>
  <li><strong>Strengths:</strong> PagedAttention, continuous batching, Prometheus metrics out of the box, P/D disaggregation via llm-d</li>
  <li><strong>Weaknesses:</strong> More complex setup, CUDA-first (Apple Silicon support via vllm-metal is experimental)</li>
  <li><strong>When to use:</strong> Production serving, multi-user, research with real load</li>
  <li><strong>When not to use:</strong> You just want to run one model locally and don’t want to think about it</li>
  <li><strong>GitHub:</strong> <a href="https://github.com/vllm-project/vllm">github.com/vllm-project/vllm</a></li>
</ul>

<h3 id="tgi-text-generation-inference">TGI (Text Generation Inference)</h3>
<ul>
  <li><strong>Origin:</strong> HuggingFace</li>
  <li><strong>Strengths:</strong> First-class support for new HuggingFace models, FlashAttention, tensor parallelism</li>
  <li><strong>Weaknesses:</strong> Slower to adopt innovations than vLLM, somewhat opinionated config</li>
  <li><strong>When to use:</strong> You’re already in the HuggingFace ecosystem and want good defaults</li>
  <li><strong>GitHub:</strong> <a href="https://github.com/huggingface/text-generation-inference">github.com/huggingface/text-generation-inference</a></li>
</ul>

<h3 id="tensorrt-llm">TensorRT-LLM</h3>
<ul>
  <li><strong>Origin:</strong> NVIDIA</li>
  <li><strong>Strengths:</strong> Best possible performance on NVIDIA hardware, optimized kernels, inference graph compilation</li>
  <li><strong>Weaknesses:</strong> NVIDIA-only, complex setup, compiled engines are model+hardware-specific (can’t move them)</li>
  <li><strong>When to use:</strong> You have a fixed model, fixed NVIDIA hardware, and need maximum performance</li>
  <li><strong>When not to use:</strong> You want flexibility, you’re running experiments, or you don’t own your hardware</li>
  <li><strong>GitHub:</strong> <a href="https://github.com/NVIDIA/TensorRT-LLM">github.com/NVIDIA/TensorRT-LLM</a></li>
</ul>

<h3 id="the-meta--research-options">The META / Research Options</h3>
<ul>
  <li><strong>llama.cpp</strong> — <a href="https://github.com/ggerganov/llama.cpp">github.com/ggerganov/llama.cpp</a>: The CPU-first runtime. Runs quantized models on CPU, reasonably fast, the ancestor of Ollama.</li>
  <li><strong>ExLlamaV2</strong> — <a href="https://github.com/turboderp-org/exllamav2">github.com/turboderp-org/exllamav2</a>: Highly optimized for RTX GPUs specifically. EXL2 quantization format is more sophisticated than GPTQ or AWQ — per-layer bit allocation instead of uniform quantization.</li>
  <li><strong>MLC-LLM</strong> — <a href="https://github.com/mlc-ai/mlc-llm">github.com/mlc-ai/mlc-llm</a>: Cross-platform (compiles to CUDA, Metal, Vulkan). Good for deploying to diverse hardware.</li>
</ul>

<hr />

<h2 id="16-the-languages-behind-it-all">16. The Languages Behind It All</h2>

<p>If you open the source code of a modern inference engine, here’s what you’ll find:</p>

<p><strong>Python</strong>: The top layer. API server, request handling, scheduling logic, metric collection. vLLM’s scheduler and OpenAI-compatible API are Python. This is also where most bugs live.</p>

<p><strong>CUDA (C++)</strong>: The performance layer. Attention kernels, memory management, GPU operations. Flash Attention is CUDA. PagedAttention’s physical block management is CUDA. If Python is the restaurant, CUDA is the kitchen.</p>

<p><strong>Rust</strong>: The fast utilities layer. HuggingFace’s <code class="language-plaintext highlighter-rouge">tokenizers</code> library is Rust — because tokenizing millions of requests fast matters. NIXL (the KV cache transfer layer in llm-d) has Rust/C++ components. Growing presence in inference tooling.</p>

<p><strong>Go</strong>: The orchestration layer. Kubernetes operators, control plane tooling, health checks, routing logic. If you’re writing infrastructure <em>around</em> inference — operators, routers, schedulers — Go is where the work happens.</p>

<p><strong>C++ (non-CUDA)</strong>: llama.cpp is pure C++ with optional CUDA/Metal backends. TensorRT-LLM has heavy C++ in the engine.</p>

<p><strong>For SREs/DevOps:</strong> You live in Python (scripts, load tests), Go (operators, K8s controllers), and YAML (unfortunately). CUDA is worth being able to <em>read</em> — not write, just understand why a kernel fusion matters and what a grid/block size means. Rust fluency is a genuine differentiator if you want to contribute upstream.</p>

<hr />

<h2 id="17-what-cpu-vs-memory-intensive-means-summarized">17. What CPU vs. Memory Intensive Means, Summarized</h2>

<p>After all of the above, here’s the clean summary of which hardware resource each step stresses:</p>

<table>
  <thead>
    <tr>
      <th>Step</th>
      <th>Hardware</th>
      <th>Why</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Tokenization</td>
      <td>CPU</td>
      <td>Text processing, hash lookups, BPE merges</td>
    </tr>
    <tr>
      <td>Embedding lookup</td>
      <td>GPU memory</td>
      <td>Row lookup in large weight matrix</td>
    </tr>
    <tr>
      <td>Positional encoding</td>
      <td>GPU compute</td>
      <td>Fast arithmetic, small matrix</td>
    </tr>
    <tr>
      <td>Prefill (attention + FFN)</td>
      <td>GPU compute</td>
      <td>Matrix multiplications, all tokens in parallel</td>
    </tr>
    <tr>
      <td>Decode attention</td>
      <td>GPU memory bandwidth</td>
      <td>KV cache read per step, memory-bound</td>
    </tr>
    <tr>
      <td>Decode FFN</td>
      <td>GPU compute</td>
      <td>Weight matrix multiply per step</td>
    </tr>
    <tr>
      <td>KV cache management</td>
      <td>GPU memory</td>
      <td>Allocation, paging, eviction</td>
    </tr>
    <tr>
      <td>Sampling (logit → token)</td>
      <td>GPU compute</td>
      <td>Softmax + sample, fast</td>
    </tr>
  </tbody>
</table>

<p>The pattern: <strong>prefill is compute-bound, decode is memory-bound</strong>. This is the fundamental tension that all inference optimization — P/D disaggregation, speculative decoding, PagedAttention, quantization — is ultimately trying to resolve.</p>

<hr />

<h2 id="18-what-an-sre-actually-needs-to-know">18. What an SRE Actually Needs to Know</h2>

<p>You don’t need to write CUDA. You don’t need to derive the attention formula. But you do need a mental model that lets you answer these questions in production:</p>

<ul>
  <li><strong>Why is TTFT high?</strong> → Prefill bottleneck or queuing. Long prompts? GPU saturated?</li>
  <li><strong>Why is ITL degrading?</strong> → KV cache pressure. Too many concurrent requests. Memory bandwidth saturating.</li>
  <li><strong>Why did the GPU OOM?</strong> → KV cache exhausted. Too many long requests, no eviction headroom.</li>
  <li><strong>Why is throughput low?</strong> → No continuous batching. Poor concurrency config. Batch size too small.</li>
  <li><strong>Why does prefix caching not help?</strong> → System prompt is changing per-request. Fix the app layer.</li>
  <li><strong>Which GPU should I buy?</strong> → For inference: memory bandwidth matters more than FLOP count. H100 &gt; A100 for serving not because of compute but because of HBM3 bandwidth.</li>
  <li><strong>Why is quantization worth it?</strong> → A 4-bit model serves faster (decode is memory-bound, smaller tensors = faster reads) and fits on cheaper hardware. Quality loss is usually acceptable.</li>
</ul>

<p>The mental model in one sentence: <strong>LLM inference is split between a compute-hungry prefill phase and a memory-hungry decode phase, connected by a KV cache that is your most critical resource to manage.</strong></p>

<p>Everything else — PagedAttention, continuous batching, P/D disaggregation, speculative decoding, quantization — is an optimization layered on top of that fundamental structure.</p>

<hr />

<h2 id="the-summary-a-mental-model-for-production">The Summary: A Mental Model for Production</h2>

<p>If you’ve made it this far, you’ve realized that LLM inference isn’t magic; it’s a measurable, optimizable systems engineering challenge. The execution pipeline breaks down into three distinct phases with unique bottlenecks: <strong>Tokenization</strong> (CPU-bound), <strong>Prefill</strong> (GPU compute-bound), and <strong>Decode</strong> (GPU memory-bound).</p>

<p>To help you debug that 2:00 AM latency spike, here is the final synthesis of the mechanics we’ve covered:</p>

<h3 id="the-war-room-reference-table">The “War Room” Reference Table</h3>

<table>
  <thead>
    <tr>
      <th style="text-align: left">Phase</th>
      <th style="text-align: left">Core Mechanism</th>
      <th style="text-align: left">Primary Bottleneck</th>
      <th style="text-align: left">SRE Metric to Watch</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left"><strong>Ingress</strong></td>
      <td style="text-align: left"><strong>Tokenization</strong></td>
      <td style="text-align: left">CPU / Latency</td>
      <td style="text-align: left">Tokenizer Latency</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>Processing</strong></td>
      <td style="text-align: left"><strong>Prefill</strong></td>
      <td style="text-align: left">GPU Compute (FLOPs)</td>
      <td style="text-align: left"><strong>TTFT</strong> (Time to First Token)</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>Generation</strong></td>
      <td style="text-align: left"><strong>Decode Loop</strong></td>
      <td style="text-align: left">Memory Bandwidth</td>
      <td style="text-align: left"><strong>ITL/TPOT</strong> (Inter-token Latency)</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>State</strong></td>
      <td style="text-align: left"><strong>KV Cache</strong></td>
      <td style="text-align: left">VRAM Capacity</td>
      <td style="text-align: left">Cache Usage % &amp; Hit Rate</td>
    </tr>
  </tbody>
</table>

<h3 id="the-final-principles-your-tldr">The Final Principles (Your TL;DR)</h3>

<ul>
  <li><strong>Weights are Static Data</strong>: The GGUF or Safetensors file is essentially a massive, math-heavy lookup table. You aren’t executing code; you are performing matrix math against frozen numbers.</li>
  <li><strong>Quantization is a Free Lunch</strong>: Lowering precision (e.g., to INT4) reduces tensor size, which directly loosens the memory bottleneck during decoding, often improving throughput by <strong>20-40%</strong>.</li>
  <li><strong>The KV Cache is Your Most Precious Resource</strong>: It prevents $O(N^2)$ recomputation by storing token state. Managing this via <strong>PagedAttention</strong> and <strong>Prefix Caching</strong> is what separates a toy demo from a production-grade service.</li>
  <li><strong>Attention is the Relationship Engine</strong>: It uses Queries, Keys, and Values to calculate which tokens matter to each other. It’s why the model understands context, but it’s also why memory pressure scales with your prompt length.</li>
  <li><strong>Continuous Batching is the Efficiency Unlock</strong>: By batching at the “word” level rather than the “request” level, you keep the GPU busy even when individual users have wildly different response lengths.</li>
</ul>

<p>The ultimate constraint in scaling model serving isn’t raw compute power, but memory bandwidth and the strict management of the KV Cache. By understanding the mechanical realities beneath the hood—like PagedAttention, continuous batching, and quantization—infrastructure engineers can move past guesswork and systematically optimize for the metrics that dictate user experience: <strong>TTFT</strong> and <strong>TPOT</strong>.</p>

<h2 id="conclusion-the-field-is-moving-the-fundamentals-arent">Conclusion: The Field Is Moving, The Fundamentals Aren’t</h2>

<p>The implementation details of LLM inference in 2025/2026 are changing fast enough to give you whiplash, but the underlying physics of the problem haven’t moved an inch. Prefill will always be compute-hungry. Decode will always be memory-hungry. Attention will always scale quadratically with context length unless someone breaks the math. These aren’t framework quirks; they are the mechanical realities of how transformers work.</p>

<p>Transitioning to LLMOps requires a fundamental shift in how we manage system state. We are no longer scaling stateless pods; we are actively managing distributed GPU memory. The engineering headroom to optimize this is enormous, and the landscape is shifting rapidly:</p>

<ul>
  <li>
    <p><strong>The model size curve is bending:</strong> Smaller, highly-optimized 7B models are now punching above the weight of older 70B giants, distributing the inference problem across a much wider variety of hardware.</p>
  </li>
  <li>
    <p><strong>The memory bottleneck is softening:</strong> Bleeding-edge KV cache compression techniques are reducing per-token memory footprint from kilobytes down to a handful of bytes, loosening the strict constraints of the decode phase.</p>
  </li>
  <li>
    <p><strong>The edge is becoming viable:</strong> As inference pushes to mobile NPUs and WebGPU, serving a 3B parameter model starts to look less like a Kubernetes workload and more like a firmware binary.</p>
  </li>
</ul>

<p>Yet, the operational reality remains unchanged. When a model is in production and serving real traffic, someone has to know why TTFT spiked at 2am, why the KV cache hit 95% utilization under Tuesday’s load, and why the p99 ITL is three times the p50.</p>

<p>Mastering the mechanics of inference separates a fragile AI prototype from a resilient production platform. The inference stack will keep evolving. The need for a rigorous mental model to debug it won’t.</p>

<hr />

<p><strong>Next up:</strong> In the next post, we’ll take these first principles and see how they dictate the architectural trade-offs behind the major inference engines: Ollama, vLLM, TGI, and TensorRT-LLM.</p>

<hr />

<p><em>I’m an infrastructure engineer with 11+ years in distributed systems (D-Wave, Enbala, MasterCard, Cisco), currently going deep on LLM serving and inference optimization. This series is grounded in hands-on experiments — Mac Mini to Lambda Labs GH200 to RunPod A100 clusters. I write what I actually learned, including the parts that didn’t work.</em></p>

<p><em>Find me on GitHub: <a href="https://github.com/kraghavan">kraghavan</a></em></p>

<p><em>Find me on Linkedin: <a href="https://linkedin.com/in/karthikaraghavan">Karthika Raghavan</a></em></p>]]></content><author><name>Karthika Raghavan</name></author><category term="llm-infrastructure" /><category term="inference" /><category term="vllm" /><category term="inference" /><category term="ttft" /><category term="tpot" /><category term="kv-cache" /><category term="attention" /><category term="tokenization" /><category term="sre" /><category term="transformers" /><summary type="html"><![CDATA[An Engineer's annotated tour through what actually happens when you hit send — from bytes to tokens to embeddings to attention to the word your model finally spits out. No skipped steps. No "and then magic happens."]]></summary></entry><entry><title type="html">Schema Travels Architecture</title><link href="https://kraghavan.github.io/2026/04/06/schema-travels-architecture.html" rel="alternate" type="text/html" title="Schema Travels Architecture" /><published>2026-04-06T00:00:00+00:00</published><updated>2026-04-06T00:00:00+00:00</updated><id>https://kraghavan.github.io/2026/04/06/schema-travels-architecture</id><content type="html" xml:base="https://kraghavan.github.io/2026/04/06/schema-travels-architecture.html"><![CDATA[<h1 id="translating-sql-to-nosql-architecture-deep-dive">Translating SQL to NoSQL: Architecture Deep-Dive</h1>

<p><em>Part 1 of 2: Design decisions, trade-offs, and algorithms behind schema-travels</em></p>

<hr />

<h2 id="why-i-built-this">Why I Built This</h2>

<p>Migrating from a relational SQL database to a NoSQL paradigm is notoriously difficult to automate. Every DBA has war stories: the naive migration that turned a 3-table JOIN into three round trips, or the “just flatten everything” approach that created 50GB documents.</p>

<p>The problem I chose: <strong>How do you automate a context-aware SQL-to-NoSQL schema migration without relying on raw, hallucination-prone LLM outputs?</strong></p>

<p>The key insight: algorithms are excellent at graph clustering; LLMs are not. But algorithms lack business context. So I built a dual-engine system where deterministic algorithms do the math, and an LLM acts as a reviewing Principal Architect—bounded, structured, and unable to hallucinate schemas into existence.</p>

<p>This post walks through every architectural decision in <a href="https://github.com/kraghavan/schema-travels">schema-travels</a>, including the trade-offs I considered and the bugs that taught me humility.</p>

<hr />

<h2 id="system-overview">System Overview</h2>

<h3 id="architecture-diagram-generated-by-notebooklm">Architecture Diagram (generated by NotebookLM)</h3>
<p><img src="/assets/images/schema-travels/schema-travels-architecture.png" alt="Schema Travels Architecture" />
<em>The Schema Travels architecture: multi-provider AI review with specialized MongoDB and DynamoDB migration flows</em></p>

<h3 id="components">Components</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>┌──────────────────────────────────────────────────────────────────────────────┐
│                              schema-travels                                  │
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────────┐  │
│  │                        Deterministic Engine                            │  │
│  │                                                                        │  │
│  │  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐                 │  │
│  │  │  SQL Log    │    │  Access     │    │  Target     │                 │  │
│  │  │  Parser     │───▶│  Pattern    │───▶│  Schema     │                 │  │
│  │  │             │    │  Analyzer   │    │  Designer   │                 │  │
│  │  │ • PostgreSQL│    │             │    │             │                 │  │
│  │  │ • MySQL     │    │ • HotJoins  │    │ • MongoDB   │                 │  │
│  │  │ • 10K+ qps  │    │ • Mutations │    │ • DynamoDB  │                 │  │
│  │  └─────────────┘    │ • Co-access │    │ • Union-Find│                 │  │
│  │                     └─────────────┘    └──────┬──────┘                 │  │
│  └───────────────────────────────────────────────┼────────────────────────┘  │
│                                                  │                           │
│                                    Algorithmic Draft (JSON)                  │
│                                                  │                           │
│  ┌───────────────────────────────────────────────▼────────────────────────┐  │
│  │                         LLM Review Layer                               │  │
│  │                                                                        │  │
│  │  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐                 │  │
│  │  │  Provider   │    │  Advisor    │    │  Pydantic   │                 │  │
│  │  │  Protocol   │───▶│  (Reviewer) │───▶│  Validator  │                 │  │
│  │  │             │    │             │    │             │                 │  │
│  │  │ • Claude    │    │ • Critique  │    │ • Bounded   │                 │  │
│  │  │ • GPT-4o    │    │ • Refine    │    │ • Typed     │                 │  │
│  │  │ • Gemini    │    │ • Explain   │    │ • No schema │                 │  │
│  │  │ • Ollama    │    │             │    │   invention │                 │  │
│  │  └─────────────┘    └─────────────┘    └──────┬──────┘                 │  │
│  └───────────────────────────────────────────────┼────────────────────────┘  │
│                                                  │                           │
│                                    Reviewed Design (JSON)                    │
│                                                  │                           │
│  ┌───────────────────────────────────────────────▼────────────────────────┐  │
│  │                         Output &amp; Caching                               │  │
│  │                                                                        │  │
│  │  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐                 │  │
│  │  │  Cache      │    │  Terraform  │    │  NoSQL      │                 │  │
│  │  │  Manager    │    │  Generator  │    │  Workbench  │                 │  │
│  │  │             │    │             │    │  Export     │                 │  │
│  │  │ • Strict    │    │ • .tf files │    │             │                 │  │
│  │  │ • Relaxed   │    │ • GSI defs  │    │ • Import    │                 │  │
│  │  │ • ~3s hits  │    │ • Capacity  │    │   ready     │                 │  │
│  │  └─────────────┘    └─────────────┘    └─────────────┘                 │  │
│  └────────────────────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────────────────────┘
</code></pre></div></div>

<hr />

<h2 id="component-1-the-statistical-ground-truth">Component 1: The Statistical Ground Truth</h2>

<h3 id="the-problem">The Problem</h3>

<p>You can design a relational database in a vacuum using normal forms. You <em>cannot</em> design a NoSQL database without knowing the access patterns.</p>

<p>Feeding an LLM just the <code class="language-plaintext highlighter-rouge">CREATE TABLE</code> statements produces generic, unoptimized schemas. It’s like asking an architect to design a house without knowing if it’s for a family of four or a fraternity.</p>

<h3 id="my-solution-log-based-pattern-analysis">My Solution: Log-Based Pattern Analysis</h3>

<p>Before any AI is invoked, the pipeline ingests thousands of raw SQL queries (up to 10K in my benchmarks). Two analyzers extract the structural signals:</p>

<table>
  <thead>
    <tr>
      <th>Analyzer</th>
      <th>What It Measures</th>
      <th>Why It Matters</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>HotJoinAnalyzer</strong></td>
      <td>Co-access frequency between tables</td>
      <td>If <code class="language-plaintext highlighter-rouge">orders</code> and <code class="language-plaintext highlighter-rouge">order_items</code> are JOINed in 95% of queries, they belong together</td>
    </tr>
    <tr>
      <td><strong>MutationAnalyzer</strong></td>
      <td>Write patterns per table</td>
      <td>High-mutation tables need different caching strategies</td>
    </tr>
  </tbody>
</table>

<p>The output is a weighted co-access matrix—essentially a graph where edge weights represent how often two tables appear together in queries.</p>

<h3 id="design-decision-static-schema-vs-log-perusal">Design Decision: Static Schema vs. Log Perusal</h3>

<p><strong>Trade-off considered:</strong></p>

<table>
  <thead>
    <tr>
      <th>Approach</th>
      <th>Speed</th>
      <th>Accuracy</th>
      <th>Limitations</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Static schema only</td>
      <td>&lt;1s</td>
      <td>~60%</td>
      <td>Misses actual usage patterns</td>
    </tr>
    <tr>
      <td>Log perusal</td>
      <td>5-30s</td>
      <td>~90%</td>
      <td>Requires query logs</td>
    </tr>
  </tbody>
</table>

<p><strong>My choice:</strong> Log perusal is non-negotiable. A schema where <code class="language-plaintext highlighter-rouge">users</code> and <code class="language-plaintext highlighter-rouge">user_preferences</code> are separate tables tells you nothing about whether they’re accessed together. Query logs tell you everything.</p>

<p><strong>Lesson learned:</strong> Early versions didn’t weight by query frequency. A table JOINed once in a rare admin query got the same weight as one JOINed 10,000 times per hour. Adding frequency weighting dramatically improved recommendations.</p>

<hr />

<h2 id="component-2-mongodb-document-designer">Component 2: MongoDB Document Designer</h2>

<h3 id="the-problem-1">The Problem</h3>

<p>MongoDB’s core design question: <strong>Embed or Reference?</strong></p>

<table>
  <thead>
    <tr>
      <th>Strategy</th>
      <th>When to Use</th>
      <th>Risk</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Embed</strong></td>
      <td>High co-access, bounded growth</td>
      <td>Document bloat (16MB limit)</td>
    </tr>
    <tr>
      <td><strong>Reference</strong></td>
      <td>Independent access, unbounded growth</td>
      <td>N+1 query patterns</td>
    </tr>
  </tbody>
</table>

<p>Getting this wrong is expensive. Embed a million reviews inside a product document and MongoDB will hate you. Reference user addresses that are always fetched with the user and you’ve recreated SQL’s JOIN problem.</p>

<h3 id="my-solution-confidence-weighted-decisions">My Solution: Confidence-Weighted Decisions</h3>

<p>The algorithm takes the co-access score from Component 1 and calculates a <strong>Confidence Score</strong> for each relationship:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Confidence = (co_access_weight × access_factor) + (bounded_growth × stability_factor)
</code></pre></div></div>

<ul>
  <li><strong>≥ 0.85</strong>: Strong embed recommendation (green in output)</li>
  <li><strong>0.70-0.85</strong>: Moderate confidence (yellow)</li>
  <li><strong>&lt; 0.70</strong>: Reference or needs human review (red)</li>
</ul>

<p>The color-coding isn’t cosmetic—it tells developers where to focus their review time.</p>

<h3 id="design-decision-schema-only-vs-schema--queries">Design Decision: Schema Only vs. Schema + Queries</h3>

<p><strong>Trade-off considered:</strong></p>

<table>
  <thead>
    <tr>
      <th>Approach</th>
      <th>Output</th>
      <th>Developer Effort</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Schema only</td>
      <td>JSON structure</td>
      <td>Developer writes queries</td>
    </tr>
    <tr>
      <td>Schema + queries</td>
      <td>Structure + aggregation pipelines</td>
      <td>Copy-paste ready</td>
    </tr>
  </tbody>
</table>

<p><strong>My choice:</strong> Generate both. For MongoDB, <code class="language-plaintext highlighter-rouge">schema-travels</code> outputs the exact aggregation pipelines or <code class="language-plaintext highlighter-rouge">findOne</code> queries that replace the original SQL JOINs.</p>

<p><strong>Why this matters:</strong> A schema migration isn’t done when you have a new data model. It’s done when your application code works. Giving developers the target schema <em>and</em> the code to query it cuts migration time significantly.</p>

<hr />

<h2 id="component-3-dynamodb-paradigm-shift">Component 3: DynamoDB Paradigm Shift</h2>

<h3 id="the-problem-2">The Problem</h3>

<p>DynamoDB isn’t just “MongoDB but AWS.” It’s a fundamentally different paradigm:</p>

<table>
  <thead>
    <tr>
      <th>MongoDB</th>
      <th>DynamoDB</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Flexible queries</td>
      <td>Pre-defined access patterns</td>
    </tr>
    <tr>
      <td>Indexes added later</td>
      <td>GSIs designed upfront</td>
    </tr>
    <tr>
      <td>Nested documents</td>
      <td>Flat, wide-column design</td>
    </tr>
    <tr>
      <td>Query anything</td>
      <td>Query only what you planned for</td>
    </tr>
  </tbody>
</table>

<p>The infamous “Single-Table Design” pattern—where multiple entity types share one table with carefully crafted partition and sort keys—is powerful but alien to SQL developers.</p>

<h3 id="my-solution-union-find-access-clustering">My Solution: Union-Find Access Clustering</h3>

<p>This is where the computer science degree earns its keep. The <code class="language-plaintext highlighter-rouge">DynamoDBDesigner</code> uses a <strong>Union-Find (Disjoint Set Union)</strong> algorithm to cluster SQL tables based on their relationship weights.</p>

<p><strong>How it works:</strong></p>

<ol>
  <li>Each SQL table starts as its own cluster</li>
  <li>For each edge (co-access relationship) above a threshold weight:
    <ul>
      <li>Find the cluster roots of both tables</li>
      <li>Union them if they’re frequently accessed together</li>
    </ul>
  </li>
  <li>Result: clusters of tables that should share a DynamoDB table</li>
</ol>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Input: users, orders, order_items, products, reviews
       
Co-access weights:
  users ←→ orders:      0.92  (high)
  orders ←→ order_items: 0.95  (high)
  products ←→ reviews:   0.45  (low)
  
After Union-Find:
  Cluster 1: {users, orders, order_items}  → Single-table candidate
  Cluster 2: {products}                     → Separate table
  Cluster 3: {reviews}                      → Separate table
</code></pre></div></div>

<p>The algorithm then generates PK/SK patterns and GSI candidates for each cluster.</p>

<h3 id="design-decision-llm-generation-vs-algorithmic-draft">Design Decision: LLM Generation vs. Algorithmic Draft</h3>

<p><strong>Trade-off considered:</strong></p>

<table>
  <thead>
    <tr>
      <th>Approach</th>
      <th>Consistency</th>
      <th>Quality</th>
      <th>Hallucination Risk</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>LLM generates schema</td>
      <td>Low</td>
      <td>Variable</td>
      <td>High</td>
    </tr>
    <tr>
      <td>Algorithm generates, LLM reviews</td>
      <td>High</td>
      <td>Consistent</td>
      <td>Low</td>
    </tr>
  </tbody>
</table>

<p><strong>My choice:</strong> Algorithms generate the draft; LLMs review it.</p>

<p>Union-Find is deterministic—same input always produces the same clusters. LLMs are not. Asking an LLM to “design a DynamoDB schema” is asking for creative writing. Asking it to “review this algorithmic draft and flag issues” is asking for structured critique.</p>

<p><strong>Lesson learned:</strong> Early versions let the LLM suggest entity renames during review. This broke the mapping back to the algorithmic draft, causing cryptic “Entity Not Found” errors. Now the validator explicitly forbids schema invention.</p>

<hr />

<h2 id="component-4-the-multi-provider-llm-layer">Component 4: The Multi-Provider LLM Layer</h2>

<h3 id="the-problem-3">The Problem</h3>

<p>I needed LLM review capabilities but didn’t want to be locked into one provider. Different providers have different strengths, costs, and availability.</p>

<h3 id="my-solution-protocol-oriented-provider-abstraction">My Solution: Protocol-Oriented Provider Abstraction</h3>

<p>The <code class="language-plaintext highlighter-rouge">LLMProvider</code> protocol defines what any provider must support:</p>

<table>
  <thead>
    <tr>
      <th>Method</th>
      <th>Purpose</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">complete()</code></td>
      <td>Generate a response</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">supports_json_mode</code></td>
      <td>Whether native JSON mode is available</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">name</code>, <code class="language-plaintext highlighter-rouge">model</code></td>
      <td>Provider identification</td>
    </tr>
  </tbody>
</table>

<p>Any class implementing this protocol works with the Advisor:</p>

<table>
  <thead>
    <tr>
      <th>Provider</th>
      <th>Default Model</th>
      <th>Cost</th>
      <th>Best For</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Claude</strong></td>
      <td>claude-sonnet-4</td>
      <td>~$3/1M</td>
      <td>Complex reasoning</td>
    </tr>
    <tr>
      <td><strong>OpenAI</strong></td>
      <td>gpt-4o-mini</td>
      <td>~$0.15/1M</td>
      <td>Cost-effective structured output</td>
    </tr>
    <tr>
      <td><strong>Gemini</strong></td>
      <td>gemini-2.0-flash</td>
      <td>~$0.10/1M</td>
      <td>Speed</td>
    </tr>
    <tr>
      <td><strong>Ollama</strong></td>
      <td>llama3.1:8b</td>
      <td>Free</td>
      <td>Privacy, offline</td>
    </tr>
  </tbody>
</table>

<h3 id="design-decision-agentic-generation-vs-agentic-review">Design Decision: Agentic Generation vs. Agentic Review</h3>

<p><strong>Trade-off considered:</strong></p>

<table>
  <thead>
    <tr>
      <th>Approach</th>
      <th>LLM Role</th>
      <th>Control</th>
      <th>Consistency</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Agentic generation</td>
      <td>Creator</td>
      <td>Low</td>
      <td>Low</td>
    </tr>
    <tr>
      <td>Agentic review</td>
      <td>Critic</td>
      <td>High</td>
      <td>High</td>
    </tr>
  </tbody>
</table>

<p><strong>My choice:</strong> The LLM is a <em>reviewer</em>, not a <em>creator</em>.</p>

<p>The prompt never says “design a schema.” It says “here is the algorithmic draft—critique it.” The LLM can flag issues:</p>

<blockquote>
  <p>“The algorithm suggested Single-Table Design here, but this Partition Key has low cardinality and will cause a hot partition. Consider adding a write-sharding suffix.”</p>
</blockquote>

<p>But it cannot invent new tables or fundamentally restructure the design. Pydantic validators enforce this boundary.</p>

<hr />

<h2 id="component-5-caching--infrastructure-output">Component 5: Caching &amp; Infrastructure Output</h2>

<h3 id="the-problem-4">The Problem</h3>

<p>LLM calls are slow (2-30 seconds) and expensive. Re-analyzing the same schema during CI/CD is wasteful.</p>

<h3 id="my-solution-dual-mode-hashing">My Solution: Dual-Mode Hashing</h3>

<table>
  <thead>
    <tr>
      <th>Mode</th>
      <th>Hash Includes</th>
      <th>Use Case</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Relaxed</strong></td>
      <td>Schema structure, log patterns</td>
      <td>Development iteration</td>
    </tr>
    <tr>
      <td><strong>Strict</strong></td>
      <td>Full payload including metadata</td>
      <td>Production, testing</td>
    </tr>
  </tbody>
</table>

<p><strong>Relaxed mode</strong> hashes only the structural elements. Minor metadata changes hit the cache, dropping a 25-second pipeline to ~3 seconds.</p>

<p><strong>Strict mode</strong> requires cryptographic match of the entire input. Essential for reproducible E2E testing.</p>

<h3 id="closing-the-loop-infrastructure-as-code">Closing the Loop: Infrastructure as Code</h3>

<p>A schema design on paper is useless if it can’t be deployed. The pipeline terminates by generating actual IaC:</p>

<table>
  <thead>
    <tr>
      <th>Output</th>
      <th>Format</th>
      <th>Use</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Terraform</strong></td>
      <td><code class="language-plaintext highlighter-rouge">.tf</code></td>
      <td><code class="language-plaintext highlighter-rouge">terraform apply</code> ready</td>
    </tr>
    <tr>
      <td><strong>NoSQL Workbench</strong></td>
      <td><code class="language-plaintext highlighter-rouge">.json</code></td>
      <td>Visual modeling, data import</td>
    </tr>
  </tbody>
</table>

<p>The Terraform output includes table definitions, GSI configurations, and capacity settings. Not a template—actual runnable infrastructure.</p>

<p><strong>Lesson learned:</strong> My Terraform formatter had a subtle bug—it generated <code class="language-plaintext highlighter-rouge">$${table_name}</code> instead of <code class="language-plaintext highlighter-rouge">"${table_name}"</code> in some edge cases. The E2E test matrix caught this because it validated the <code class="language-plaintext highlighter-rouge">.tf</code> files could be parsed by Terraform’s HCL parser.</p>

<hr />

<h2 id="what-id-do-differently">What I’d Do Differently</h2>

<h3 id="1-dynamodb-query-translation-from-day-one">1. DynamoDB Query Translation from Day One</h3>

<p>The MongoDB module translates SQL JOINs to aggregation pipelines. The DynamoDB module only outputs schemas—no Boto3 query code. This asymmetry bothers me.</p>

<p>Generating <code class="language-plaintext highlighter-rouge">dynamodb.query(KeyConditionExpression=...)</code> calls alongside the schema would make the DynamoDB path as developer-friendly as MongoDB.</p>

<h3 id="2-stricter-entity-name-validation">2. Stricter Entity Name Validation</h3>

<p>The LLM occasionally renames entities during review (“I’ll call this <code class="language-plaintext highlighter-rouge">CustomerOrders</code> instead of <code class="language-plaintext highlighter-rouge">user_orders</code>”). This breaks the mapping back to the algorithmic draft.</p>

<p>The fix was a Pydantic validator that rejects any entity name not in the original input. But I should have anticipated this—LLMs love to be “helpful” by renaming things.</p>

<h3 id="3-cache-key-design-up-front">3. Cache Key Design Up Front</h3>

<p>I added the <code class="language-plaintext highlighter-rouge">dynamodb_mode</code> parameter to cache keys late. This meant cached results from <code class="language-plaintext highlighter-rouge">auto</code> mode were incorrectly served for <code class="language-plaintext highlighter-rouge">single</code> mode requests.</p>

<p>Cache key design should happen during architecture, not debugging.</p>

<hr />

<h2 id="the-proof-what-actually-happened">The Proof: What Actually Happened</h2>

<p>Let me show you what this architecture produces in practice. I ran the E2E test matrix across 4 providers × 2 targets on a 42-table e-commerce schema with 100 synthetic queries:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>┌─────────────┬──────────────┬──────────────┐
│  Provider   │   MongoDB    │   DynamoDB   │
├─────────────┼──────────────┼──────────────┤
│ claude      │ ✓ PASS       │ ✓ PASS       │
│ openai      │ ✓ PASS       │ ✓ PASS       │
│ gemini      │ ✓ PASS       │ ✓ PASS       │
│ ollama      │ ✓ PASS       │ ✓ PASS       │
└─────────────┴──────────────┴──────────────┘
Total: 8 tests | Passed: 8 | Failed: 0
</code></pre></div></div>

<p><strong>What the numbers tell us:</strong></p>

<table>
  <thead>
    <tr>
      <th>Provider</th>
      <th>DynamoDB Design</th>
      <th>Confidence</th>
      <th>GSIs Generated</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Claude</strong></td>
      <td>single_table</td>
      <td>85%</td>
      <td>4 (overloaded)</td>
    </tr>
    <tr>
      <td><strong>OpenAI</strong></td>
      <td>single_table</td>
      <td>77.5%</td>
      <td>5 (descriptive names)</td>
    </tr>
    <tr>
      <td><strong>Gemini</strong></td>
      <td>single_table</td>
      <td>75%</td>
      <td>5 (direct attributes)</td>
    </tr>
    <tr>
      <td><strong>Ollama</strong> (local)</td>
      <td>multi_table</td>
      <td>80%</td>
      <td>per-table</td>
    </tr>
  </tbody>
</table>

<p>Here’s the insight that makes this architecture work: <strong>the algorithmic clusters are identical across all providers</strong>. Products had 82 accesses. Users had 54. Orders had 32. The Union-Find algorithm doesn’t care which LLM you’re using—it produces the same deterministic foundation every time.</p>

<p>But the <em>interpretation</em> differs. Three cloud providers looked at the same clusters and said “single-table design with GSI overloading.” The local model (gemma3:4b running on my Mac Mini) said “multi-table is fine here.”</p>

<p>Just like every architecture review I’ve ever been in—except this one finished in 22 seconds and nobody rage-quit to “work from home.”</p>

<p>Who’s right? Honestly, both approaches are defensible for this workload. The point isn’t that one answer is correct—it’s that the AI is <em>reviewing</em> a solid algorithmic foundation, not hallucinating schemas from scratch.</p>

<p><strong>MongoDB showed similar patterns:</strong></p>

<p>The Claude review of MongoDB produced 12 recommendations with confidence-scored decisions:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">orders → order_items</code>: <strong>EMBED</strong> (95% confidence) — “Perfect co-access, always accessed together”</li>
  <li><code class="language-plaintext highlighter-rouge">products → reviews</code>: <strong>REFERENCE</strong> (90% confidence) — “Unbounded growth, popular products can have thousands of reviews”</li>
  <li><code class="language-plaintext highlighter-rouge">users → addresses</code>: <strong>EMBED</strong> (85% confidence) — “High co-access, addresses typically accessed with user profile”</li>
</ul>

<p>That 90% confidence REFERENCE decision for reviews? That’s exactly the kind of nuanced judgment that makes NoSQL design hard. The algorithm detected high co-access, but the AI correctly identified the unbounded growth risk that would blow past MongoDB’s 16MB document limit.</p>

<p>More than 40 tables migrated, zero DBAs traumatized. Though if your schema has more foreign keys than a hotel concierge, maybe grab coffee first.</p>

<hr />

<h2 id="the-bigger-picture-what-this-actually-solves">The Bigger Picture: What This Actually Solves</h2>

<p><strong>The Old Way:</strong> A senior engineer spends 2-3 weeks analyzing query logs, drawing ER diagrams on whiteboards, debating embed-vs-reference decisions in meetings, and manually translating that into DynamoDB access patterns. Then they write Terraform by hand and pray they didn’t miss a GSI.</p>

<p><strong>The New Way:</strong> Feed the tool your PostgreSQL logs and schema. Get a reviewed, confidence-scored design with deployable Terraform in 22 seconds. Spend those 2-3 weeks on the parts that actually require human judgment—data migration strategy, application refactoring, rollback planning.</p>

<p>This isn’t about replacing engineers. It’s about replacing the <em>tedious parts</em> of engineering so we can focus on the <em>interesting parts</em>.</p>

<h3 id="the-core-innovation-bounded-ai">The Core Innovation: Bounded AI</h3>

<p>The fundamental insight behind this architecture is that <strong>LLMs should critique, not create</strong>.</p>

<table>
  <thead>
    <tr>
      <th>Approach</th>
      <th>What Can Go Wrong</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>“LLM, design me a schema”</td>
      <td>Hallucinated tables, inconsistent naming, unbounded creativity</td>
    </tr>
    <tr>
      <td>“LLM, review this algorithmic draft”</td>
      <td>Bounded feedback, structured output, deterministic baseline</td>
    </tr>
  </tbody>
</table>

<p>By forcing the AI into a reviewer role with Pydantic-enforced boundaries, we get the benefits of LLM intelligence (business context, heuristic reasoning, natural language explanations) without the chaos of unbounded generation.</p>

<p>This pattern—<strong>algorithmic draft + bounded LLM review</strong>—is applicable far beyond schema migration. It’s how I’d approach any problem where you need AI assistance but can’t afford hallucinations.</p>

<hr />

<h2 id="whats-next-graphrag-for-enterprise-scale">What’s Next: GraphRAG for Enterprise Scale</h2>

<p>The current implementation handles schemas with 10-50 tables comfortably. But what happens when you’re migrating a legacy enterprise system with 500 tables? 1,000?</p>

<p>At that scale, the co-access matrix becomes unwieldy, the visualization becomes spaghetti, and even Union-Find starts to strain under the combinatorial explosion.</p>

<p><strong>The roadmap:</strong> GraphRAG-powered schema analysis.</p>

<p>The idea is straightforward:</p>
<ol>
  <li><strong>Store the schema graph</strong> in a proper graph database (NetworkX + SQLite for now, Neo4j for enterprise)</li>
  <li><strong>Embed table semantics</strong> using sentence transformers</li>
  <li><strong>Query with natural language</strong>: “Show me all tables related to order fulfillment” or “Which clusters would be affected if we split the users table?”</li>
</ol>

<p>This transforms schema-travels from a migration tool into a <strong>schema intelligence platform</strong>. Instead of processing everything at once, you can explore the graph, ask questions, and migrate incrementally—cluster by cluster, bounded by what your team can absorb.</p>

<hr />

<h2 id="coming-up-part-2">Coming Up: Part 2</h2>

<p>In Part 2, we’ll dive deeper into the E2E test results:</p>

<ul>
  <li><strong>Latency breakdown</strong>: 22 seconds for first run → 3 seconds cached</li>
  <li><strong>Provider comparison</strong>: Why Claude, OpenAI, and Gemini agreed on single-table while Ollama chose multi-table</li>
  <li><strong>The cache drift bug</strong>: How the test matrix caught an inconsistency in DynamoDB Terraform output</li>
  <li><strong>Cost analysis</strong>: Running the full matrix cost less than $0.50 in API calls</li>
</ul>

<hr />

<h2 id="final-thoughts">Final Thoughts</h2>

<p>I’ve been building distributed systems for over a decade. In that time, I’ve seen plenty of “AI-powered” tools that are really just prompt wrappers—impressive demos that fall apart the moment you need reproducibility.</p>

<p><code class="language-plaintext highlighter-rouge">schema-travels</code> is my answer to the question: <em>How do you build AI-assisted tooling that a principal engineer would actually sign off on?</em></p>

<p>The answer, it turns out, is the same principle that makes distributed systems reliable: <strong>don’t trust any single component</strong>. Algorithms provide the deterministic foundation. LLMs provide the intelligence. Pydantic provides the guardrails. Caching provides the economics. And an 8-test E2E matrix provides the confidence that it all actually works.</p>

<p>The schema migration problem was just the vehicle. The real artifact is an architecture pattern for building AI tools that are auditable, reproducible, and won’t make your on-call engineer cry at 3am.</p>

<hr />

<p><strong>GitHub:</strong> <a href="https://github.com/kraghavan/schema-travels">github.com/kraghavan/schema-travels</a></p>

<p><em>Questions, feedback, or war stories from your own migrations? Connect with me on <a href="https://linkedin.com/in/karthikaraghavan">LinkedIn</a>.</em></p>]]></content><author><name>Karthika Raghavan</name></author><summary type="html"><![CDATA[Translating SQL to NoSQL: Architecture Deep-Dive]]></summary></entry><entry><title type="html">Building a Privacy-Aware LLM Gateway: Benchmarking Results</title><link href="https://kraghavan.github.io/llm/infrastructure/benchmarks/2026/03/21/inference-sentinel-benchmarks2.html" rel="alternate" type="text/html" title="Building a Privacy-Aware LLM Gateway: Benchmarking Results" /><published>2026-03-21T00:00:00+00:00</published><updated>2026-03-21T00:00:00+00:00</updated><id>https://kraghavan.github.io/llm/infrastructure/benchmarks/2026/03/21/inference-sentinel-benchmarks2</id><content type="html" xml:base="https://kraghavan.github.io/llm/infrastructure/benchmarks/2026/03/21/inference-sentinel-benchmarks2.html"><![CDATA[<h1 id="building-a-privacy-aware-llm-gateway-benchmarking-results">Building a Privacy-Aware LLM Gateway: Benchmarking Results</h1>

<p><em>Part 2 of 2: Empirical evaluation of classification accuracy, routing performance, and cost attribution</em></p>

<hr />

<h2 id="abstract">Abstract</h2>

<p>In <a href="/llm/infrastructure/smart%20gateway/python/2026/03/20/inference-sentinel-architecture.html">Part 1</a>, I described the architecture of inference-sentinel, a privacy-aware LLM routing gateway. This post presents empirical results from five experiments evaluating classification accuracy, routing latency, cost efficiency, controller effectiveness, and session stickiness.</p>

<p><strong>Key findings:</strong></p>
<ul>
  <li>The hybrid classifier achieves 97.5% accuracy with 0.16ms mean latency, though a systematic failure mode in Tier 3 detection reveals the importance of enabling NER for healthcare identifiers</li>
  <li>Local inference introduces a 10× latency penalty compared to cloud backends, a fundamental trade-off for privacy preservation</li>
  <li>Routing 47.5% of traffic locally yields 68.6% cost savings, with an unexpected finding that Google’s Gemini is 44× cheaper per request than Anthropic’s Claude</li>
  <li>The closed-loop controller correctly withheld recommendations when local-cloud quality divergence exceeded thresholds</li>
</ul>

<hr />

<h2 id="1-experimental-setup">1. Experimental Setup</h2>

<h3 id="11-hardware-configuration">1.1 Hardware Configuration</h3>

<table>
  <thead>
    <tr>
      <th>Component</th>
      <th>Specification</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Gateway Host</strong></td>
      <td>Docker container (inference-sentinel)</td>
    </tr>
    <tr>
      <td><strong>Local Inference</strong></td>
      <td>Apple Mac Mini M4, 16GB unified memory</td>
    </tr>
    <tr>
      <td><strong>Local Models</strong></td>
      <td>Ollama serving gemma3:4b and mistral (round-robin)</td>
    </tr>
    <tr>
      <td><strong>Cloud Backends</strong></td>
      <td>Claude Sonnet 4 (Anthropic), Gemini 2.0 Flash (Google)</td>
    </tr>
  </tbody>
</table>

<h3 id="12-dataset">1.2 Dataset</h3>

<p>I constructed a synthetic evaluation dataset of 200 prompts with known ground-truth privacy labels, balanced across four tiers:</p>

<table>
  <thead>
    <tr>
      <th>Tier</th>
      <th>Label</th>
      <th>Count</th>
      <th>Example Patterns</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>0</td>
      <td>PUBLIC</td>
      <td>50</td>
      <td>General knowledge questions</td>
    </tr>
    <tr>
      <td>1</td>
      <td>INTERNAL</td>
      <td>50</td>
      <td>Project codes, internal URLs</td>
    </tr>
    <tr>
      <td>2</td>
      <td>CONFIDENTIAL</td>
      <td>50</td>
      <td>Email addresses, phone numbers</td>
    </tr>
    <tr>
      <td>3</td>
      <td>RESTRICTED</td>
      <td>50</td>
      <td>SSNs, credit cards, health records</td>
    </tr>
  </tbody>
</table>

<p>The balanced design enables per-class precision/recall analysis without class imbalance confounds.</p>

<h3 id="13-experimental-protocol">1.3 Experimental Protocol</h3>

<p>Each experiment was run independently with the gateway in a fresh state. Metrics were collected via Prometheus and exported to JSON for analysis. All experiments used the same dataset to enable cross-experiment comparison.</p>

<hr />

<h2 id="2-experiment-1-classification-accuracy">2. Experiment 1: Classification Accuracy</h2>

<p><strong>Research Question:</strong> How accurately does the hybrid classifier assign privacy tiers, and what are the failure modes?</p>

<h3 id="21-results">2.1 Results</h3>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Overall Accuracy</strong></td>
      <td>97.5% (195/200)</td>
    </tr>
    <tr>
      <td><strong>Mean Classification Time</strong></td>
      <td>0.16ms</td>
    </tr>
    <tr>
      <td><strong>Misclassifications</strong></td>
      <td>5</td>
    </tr>
  </tbody>
</table>

<p><img src="/assets/images/inference-sentinel/tier_metrics.png" alt="Classification Metrics by Tier" />
<em>Figure 1: Per-tier precision, recall, and F1 scores. The dashed line indicates the 95% threshold. Tier 3 recall (90%) falls below threshold due to undetected health insurance identifiers.</em></p>

<h3 id="22-per-tier-analysis">2.2 Per-Tier Analysis</h3>

<table>
  <thead>
    <tr>
      <th>Tier</th>
      <th>Precision</th>
      <th>Recall</th>
      <th>F1</th>
      <th>TP</th>
      <th>FP</th>
      <th>FN</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>0 (PUBLIC)</td>
      <td>90.9%</td>
      <td>100%</td>
      <td>95.2%</td>
      <td>50</td>
      <td>5</td>
      <td>0</td>
    </tr>
    <tr>
      <td>1 (INTERNAL)</td>
      <td>100%</td>
      <td>100%</td>
      <td>100%</td>
      <td>50</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <td>2 (CONFIDENTIAL)</td>
      <td>100%</td>
      <td>100%</td>
      <td>100%</td>
      <td>50</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <td>3 (RESTRICTED)</td>
      <td>100%</td>
      <td>90%</td>
      <td>94.7%</td>
      <td>45</td>
      <td>0</td>
      <td>5</td>
    </tr>
  </tbody>
</table>

<p>The asymmetry between Tier 0 and Tier 3 is notable: Tier 0 has 5 false positives (items incorrectly classified as PUBLIC), while Tier 3 has 5 false negatives (restricted items missed). These are the same 5 samples — Tier 3 prompts misclassified as Tier 0.</p>

<p><img src="/assets/images/inference-sentinel/confusion_matrix.png" alt="Confusion Matrix" />
<em>Figure 2: Confusion matrix showing classification performance. Off-diagonal entries appear only in the Tier 3→Tier 0 cell, indicating a systematic failure mode rather than random errors.</em></p>

<h3 id="23-error-analysis">2.3 Error Analysis</h3>

<p>All five misclassifications share a common pattern:</p>

<table>
  <thead>
    <tr>
      <th>Prompt ID</th>
      <th>Text Preview</th>
      <th>Expected</th>
      <th>Predicted</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>tier3_277eb0aa</td>
      <td>“Patient Alyssa Nelson, health insurance ID: FRJ508021882, prescribed Lisinopril 10mg daily.”</td>
      <td>Tier 3</td>
      <td>Tier 0</td>
    </tr>
    <tr>
      <td>tier3_67f1698c</td>
      <td>“Patient Joseph Hill, health insurance ID: WRY526164289, prescribed Metformin 25mg twice daily.”</td>
      <td>Tier 3</td>
      <td>Tier 0</td>
    </tr>
    <tr>
      <td>tier3_8989c5ab</td>
      <td>“Patient Michael Weaver, health insurance ID: IIP473415078, prescribed Atorvastatin 25mg twice daily.”</td>
      <td>Tier 3</td>
      <td>Tier 0</td>
    </tr>
    <tr>
      <td>tier3_02ed73ce</td>
      <td>“Patient Levi Fowler, health insurance ID: WRV424211872, prescribed Metformin 500mg with meals.”</td>
      <td>Tier 3</td>
      <td>Tier 0</td>
    </tr>
    <tr>
      <td>tier3_bb134f02</td>
      <td>“Patient Brandon Davis, health insurance ID: SIE176051319, prescribed Metformin 10mg daily.”</td>
      <td>Tier 3</td>
      <td>Tier 0</td>
    </tr>
  </tbody>
</table>

<p><strong>Root Cause Analysis:</strong></p>

<p>NER was enabled during this benchmark, yet the <code class="language-plaintext highlighter-rouge">PERSON_NAME</code> entities were not detected. The failures stem from two factors:</p>

<ol>
  <li>
    <p><strong>Missing regex pattern:</strong> The health insurance ID format (<code class="language-plaintext highlighter-rouge">[A-Z]{3}\d{9}</code>) is not covered by existing Tier 3 patterns, which target SSNs (<code class="language-plaintext highlighter-rouge">\d{3}-\d{2}-\d{4}</code>), credit cards (Luhn-valid sequences), and MRN patterns (<code class="language-plaintext highlighter-rouge">MRN:\s*\d+</code>).</p>
  </li>
  <li>
    <p><strong>NER model limitation:</strong> The HuggingFace Transformers BERT model (<code class="language-plaintext highlighter-rouge">dslim/bert-base-NER</code>) failed to recognize the person names in medical record context. The phrase structure “Patient [Name], health insurance ID…” appears to confuse the model — likely because “Patient” is parsed as part of the name span, or the surrounding medical terminology disrupts entity boundary detection.</p>
  </li>
</ol>

<p>Examining the detection results:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"expected_entities"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"PERSON_NAME"</span><span class="p">,</span><span class="w"> </span><span class="s2">"PERSON_NAME"</span><span class="p">],</span><span class="w">
  </span><span class="nl">"detected_entities"</span><span class="p">:</span><span class="w"> </span><span class="p">[]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>The NER model returned zero entities despite clear person names being present. This is a known limitation of lightweight NER models on domain-specific text — medical, legal, and financial documents often require fine-tuned models.</p>

<p><strong>Implication:</strong> The 10% false negative rate on Tier 3 represents exactly the failure mode that matters most in a privacy system — restricted data being classified as public. This is not acceptable for production deployment without remediation.</p>

<h3 id="24-remediation">2.4 Remediation</h3>

<p>Three complementary approaches:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Fix 1: Add regex pattern for health insurance IDs
</span><span class="n">PATTERNS</span><span class="p">[</span><span class="s">"health_insurance"</span><span class="p">]</span> <span class="o">=</span> <span class="sa">r</span><span class="s">'\b(?:health\s*insurance\s*id|member\s*id)[:\s]*[A-Z]{2,4}\d{8,12}\b'</span>

<span class="c1"># Fix 2: Add "Patient [Name]" pattern for medical contexts
</span><span class="n">PATTERNS</span><span class="p">[</span><span class="s">"patient_name"</span><span class="p">]</span> <span class="o">=</span> <span class="sa">r</span><span class="s">'\bPatient\s+[A-Z][a-z]+\s+[A-Z][a-z]+\b'</span>

<span class="c1"># Fix 3: Consider larger NER model for production
# "accurate" mode uses Jean-Baptiste/roberta-large-ner-english
</span></code></pre></div></div>

<p>The regex-first approach is particularly important here: rather than relying solely on NER for entity detection, adding domain-specific patterns provides a deterministic safety net for known sensitive formats.</p>

<hr />

<h2 id="3-experiment-2-routing-performance">3. Experiment 2: Routing Performance</h2>

<p><strong>Research Question:</strong> What latency overhead does the gateway introduce, and how does local inference compare to cloud?</p>

<h3 id="31-results">3.1 Results</h3>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Total Requests</strong></td>
      <td>200</td>
    </tr>
    <tr>
      <td><strong>Successful</strong></td>
      <td>183 (91.5%)</td>
    </tr>
    <tr>
      <td><strong>Failed</strong></td>
      <td>17 (8.5%)</td>
    </tr>
    <tr>
      <td><strong>Total Duration</strong></td>
      <td>1,421 seconds</td>
    </tr>
    <tr>
      <td><strong>Throughput</strong></td>
      <td>0.13 req/s</td>
    </tr>
  </tbody>
</table>

<p><img src="/assets/images/inference-sentinel/latency_distribution.png" alt="Latency Distribution" />
<em>Figure 3: End-to-end latency distribution showing heavy right tail. The p99 latency (62.9s) is 26× higher than p50 (2.4s), indicating high variance primarily from local inference.</em></p>

<h3 id="32-latency-decomposition">3.2 Latency Decomposition</h3>

<table>
  <thead>
    <tr>
      <th>Component</th>
      <th>Mean Latency</th>
      <th>% of Total</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Classification</td>
      <td>1.47ms</td>
      <td>0.02%</td>
    </tr>
    <tr>
      <td>Routing Decision</td>
      <td>0.22ms</td>
      <td>0.003%</td>
    </tr>
    <tr>
      <td>Inference</td>
      <td>7,729ms</td>
      <td>99.98%</td>
    </tr>
  </tbody>
</table>

<p><strong>Key Finding:</strong> The gateway overhead (classification + routing) is <strong>1.69ms</strong> — effectively invisible relative to inference time. The privacy-aware routing layer does not meaningfully impact end-to-end latency.</p>

<h3 id="33-latency-by-route">3.3 Latency by Route</h3>

<table>
  <thead>
    <tr>
      <th>Route</th>
      <th>Tier</th>
      <th>Count</th>
      <th>Mean</th>
      <th>p50</th>
      <th>p95</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Cloud</td>
      <td>0 (PUBLIC)</td>
      <td>50</td>
      <td>1,669ms</td>
      <td>1,755ms</td>
      <td>2,737ms</td>
    </tr>
    <tr>
      <td>Cloud</td>
      <td>1 (INTERNAL)</td>
      <td>50</td>
      <td>1,571ms</td>
      <td>1,182ms</td>
      <td>3,059ms</td>
    </tr>
    <tr>
      <td>Local</td>
      <td>2 (CONFIDENTIAL)</td>
      <td>42</td>
      <td>15,523ms</td>
      <td>9,949ms</td>
      <td>60,722ms</td>
    </tr>
    <tr>
      <td>Local</td>
      <td>3 (RESTRICTED)</td>
      <td>36</td>
      <td>16,643ms</td>
      <td>5,959ms</td>
      <td>48,418ms</td>
    </tr>
  </tbody>
</table>

<p><strong>The latency trade-off is stark:</strong> Local inference is approximately <strong>10× slower</strong> than cloud. This is the fundamental cost of privacy preservation with consumer-grade hardware.</p>

<h3 id="34-error-analysis">3.4 Error Analysis</h3>

<p>All 17 failures returned <code class="language-plaintext highlighter-rouge">HTTP 503: No healthy local backends available</code>. Examining the error distribution:</p>

<table>
  <thead>
    <tr>
      <th>Tier</th>
      <th>Failures</th>
      <th>Total Requests</th>
      <th>Failure Rate</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Tier 2</td>
      <td>8</td>
      <td>50</td>
      <td>16%</td>
    </tr>
    <tr>
      <td>Tier 3</td>
      <td>9</td>
      <td>50</td>
      <td>18%</td>
    </tr>
    <tr>
      <td>Tier 0-1</td>
      <td>0</td>
      <td>100</td>
      <td>0%</td>
    </tr>
  </tbody>
</table>

<p>Failures occurred exclusively on local-routed traffic. The root cause is <strong>memory pressure</strong>: the Mac Mini M4 with 16GB unified memory struggles to serve concurrent requests across two loaded models (gemma3:4b ≈ 3GB, mistral ≈ 4GB).</p>

<p><strong>Mitigation strategies:</strong></p>
<ol>
  <li>Reduce to single local model (eliminates round-robin memory contention)</li>
  <li>Implement request queuing with backpressure</li>
  <li>Upgrade to 32GB+ RAM for concurrent model serving</li>
  <li>Increase health check timeout to tolerate transient memory pressure</li>
</ol>

<hr />

<h2 id="4-experiment-3-cost-attribution">4. Experiment 3: Cost Attribution</h2>

<p><strong>Research Question:</strong> What are the realized cost savings from local routing, and how do cloud backends compare?</p>

<h3 id="41-results">4.1 Results</h3>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Actual Cost</strong></td>
      <td>$0.0844</td>
    </tr>
    <tr>
      <td><strong>Hypothetical All-Cloud</strong></td>
      <td>$0.2687</td>
    </tr>
    <tr>
      <td><strong>Savings</strong></td>
      <td>$0.1843 (68.6%)</td>
    </tr>
  </tbody>
</table>

<p><img src="/assets/images/inference-sentinel/cost_comparison.png" alt="Cost Attribution" />
<em>Figure 4: Cost comparison showing actual spend vs. hypothetical all-cloud routing. Local routing of Tier 2-3 traffic yields 68.6% cost reduction.</em></p>

<h3 id="42-routing-distribution">4.2 Routing Distribution</h3>

<p><img src="/assets/images/inference-sentinel/routing_distribution.png" alt="Routing Distribution" />
<em>Figure 5: Request distribution by route. 42.6% of requests routed to local inference (privacy-sensitive), 57.4% to cloud (non-sensitive).</em></p>

<table>
  <thead>
    <tr>
      <th>Route</th>
      <th>Requests</th>
      <th>Percentage</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Local</td>
      <td>95</td>
      <td>47.5%</td>
    </tr>
    <tr>
      <td>Cloud</td>
      <td>105</td>
      <td>52.5%</td>
    </tr>
  </tbody>
</table>

<h3 id="43-backend-cost-analysis">4.3 Backend Cost Analysis</h3>

<table>
  <thead>
    <tr>
      <th>Backend</th>
      <th>Requests</th>
      <th>Total Cost</th>
      <th>Cost/Request</th>
      <th>Cost/1K Tokens</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Anthropic (Claude)</td>
      <td>53</td>
      <td>$0.0826</td>
      <td>$0.00156</td>
      <td>$0.0130</td>
    </tr>
    <tr>
      <td>Google (Gemini)</td>
      <td>52</td>
      <td>$0.00185</td>
      <td>$0.0000355</td>
      <td>$0.00036</td>
    </tr>
    <tr>
      <td>Local (Ollama)</td>
      <td>95</td>
      <td>$0.00</td>
      <td>$0.00</td>
      <td>$0.00</td>
    </tr>
  </tbody>
</table>

<p><img src="/assets/images/inference-sentinel/cost_by_backend.png" alt="Cost by Backend" />
<em>Figure 6: Cost distribution by backend. Anthropic accounts for 97.8% of cloud spend despite handling only 50.5% of cloud requests.</em></p>

<p><strong>Unexpected Finding:</strong> Gemini is <strong>44× cheaper per request</strong> than Claude ($0.0000355 vs $0.00156). This suggests a potential optimization: use Gemini as the primary cloud backend for cost-sensitive workloads, reserving Claude for quality-critical requests.</p>

<h3 id="44-cost-by-privacy-tier">4.4 Cost by Privacy Tier</h3>

<table>
  <thead>
    <tr>
      <th>Tier</th>
      <th>Requests</th>
      <th>Routed Local</th>
      <th>Cost</th>
      <th>Savings</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>0 (PUBLIC)</td>
      <td>55</td>
      <td>0</td>
      <td>$0.0400</td>
      <td>$0.0420</td>
    </tr>
    <tr>
      <td>1 (INTERNAL)</td>
      <td>50</td>
      <td>0</td>
      <td>$0.0444</td>
      <td>$0.0248</td>
    </tr>
    <tr>
      <td>2 (CONFIDENTIAL)</td>
      <td>50</td>
      <td>50</td>
      <td>$0.00</td>
      <td>$0.0575</td>
    </tr>
    <tr>
      <td>3 (RESTRICTED)</td>
      <td>45</td>
      <td>45</td>
      <td>$0.00</td>
      <td>$0.0600</td>
    </tr>
  </tbody>
</table>

<p><strong>Tier 2 and Tier 3 traffic incurs zero marginal cost</strong> after hardware investment. For organizations with significant sensitive data volumes, the ROI calculation favors local inference.</p>

<h3 id="45-projected-annual-savings">4.5 Projected Annual Savings</h3>

<p>Extrapolating from observed cost ratios:</p>

<table>
  <thead>
    <tr>
      <th>Daily Volume</th>
      <th>Annual Cloud-Only</th>
      <th>Annual with Sentinel</th>
      <th>Savings</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1,000 req</td>
      <td>$490</td>
      <td>$154</td>
      <td>$336</td>
    </tr>
    <tr>
      <td>10,000 req</td>
      <td>$4,900</td>
      <td>$1,540</td>
      <td>$3,360</td>
    </tr>
    <tr>
      <td>100,000 req</td>
      <td>$49,000</td>
      <td>$15,400</td>
      <td>$33,600</td>
    </tr>
  </tbody>
</table>

<p><strong>Caveat:</strong> These projections assume similar traffic distribution (47.5% local-eligible) and do not account for hardware depreciation, electricity, or operational overhead.</p>

<hr />

<h2 id="5-experiment-4-controller-effectiveness">5. Experiment 4: Controller Effectiveness</h2>

<p><strong>Research Question:</strong> Does the closed-loop controller generate actionable routing recommendations?</p>

<h3 id="51-results">5.1 Results</h3>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Controller Evaluations</strong></td>
      <td>117</td>
    </tr>
    <tr>
      <td><strong>Recommendations Generated</strong></td>
      <td>0</td>
    </tr>
    <tr>
      <td><strong>Drift Detected</strong></td>
      <td>No</td>
    </tr>
  </tbody>
</table>

<h3 id="52-shadow-mode-metrics">5.2 Shadow Mode Metrics</h3>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Shadow Runs</strong></td>
      <td>340</td>
    </tr>
    <tr>
      <td><strong>Successful Comparisons</strong></td>
      <td>275</td>
    </tr>
    <tr>
      <td><strong>Quality Match Rate</strong></td>
      <td>0%</td>
    </tr>
    <tr>
      <td><strong>Cost Savings Tracked</strong></td>
      <td>$0.15</td>
    </tr>
  </tbody>
</table>

<h3 id="53-analysis">5.3 Analysis</h3>

<p>The controller generated zero recommendations because the <strong>quality match rate was 0%</strong>. This means local model responses (gemma3:4b, mistral) were semantically dissimilar enough from cloud responses (Claude, Gemini) that they never crossed the similarity threshold (default: 85%).</p>

<p><strong>This is informative, not a failure.</strong> The controller correctly identified that:</p>
<ol>
  <li>Local models produce qualitatively different outputs than cloud models</li>
  <li>Promoting Tier 0-1 traffic from cloud to local would degrade response quality</li>
  <li>The conservative default (keep on cloud) is appropriate</li>
</ol>

<h3 id="54-interpretation">5.4 Interpretation</h3>

<p>The 0% quality match rate reflects the capability gap between 4B-parameter local models and frontier cloud models. For tasks where approximate answers suffice, lowering the similarity threshold would generate recommendations:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">controller</span><span class="pi">:</span>
  <span class="na">quality_threshold</span><span class="pi">:</span> <span class="m">0.70</span>  <span class="c1"># Down from 0.85</span>
</code></pre></div></div>

<p>However, this requires explicit acceptance of quality trade-offs — a decision the controller correctly defers to human operators.</p>

<hr />

<h2 id="6-experiment-5-session-stickiness">6. Experiment 5: Session Stickiness</h2>

<p><strong>Research Question:</strong> Does the one-way trapdoor correctly lock sessions after PII detection?</p>

<h3 id="61-results">6.1 Results</h3>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Sessions Tested</strong></td>
      <td>20</td>
    </tr>
    <tr>
      <td><strong>Requests per Session</strong></td>
      <td>10</td>
    </tr>
    <tr>
      <td><strong>PII Probability</strong></td>
      <td>30%</td>
    </tr>
    <tr>
      <td><strong>Sessions Locked</strong></td>
      <td>0</td>
    </tr>
    <tr>
      <td><strong>Trapdoor Violations</strong></td>
      <td>0</td>
    </tr>
  </tbody>
</table>

<h3 id="62-analysis">6.2 Analysis</h3>

<p><strong>Zero sessions were locked</strong> despite 30% of requests containing PII and session tracking being enabled. Examining the test methodology reveals the issue:</p>

<p>The benchmark harness sent requests with simulated IPs (<code class="language-plaintext highlighter-rouge">10.0.0.0</code> through <code class="language-plaintext highlighter-rouge">10.0.0.19</code>), but the session ID computation may not have properly differentiated these synthetic sources. Additionally, the PII detection failures identified in Experiment 1 (the 5 health insurance records) would have prevented those sessions from locking — if PII isn’t detected, the trapdoor isn’t triggered.</p>

<p><strong>Contributing factors:</strong></p>

<ol>
  <li>
    <p><strong>Classification dependency:</strong> Session locking requires Tier 2+ classification. The 10% Tier 3 false negative rate means some PII-containing requests were classified as Tier 0, preventing session locks.</p>
  </li>
  <li>
    <p><strong>Test methodology:</strong> Requests originated from the same physical host with simulated client IPs. The session ID hashing (<code class="language-plaintext highlighter-rouge">SHA-256(client_ip + daily_salt)</code>) should differentiate these, but the harness may need validation.</p>
  </li>
  <li>
    <p><strong>PII probability vs. detection:</strong> The 30% PII probability applies to dataset generation, but if those PII patterns aren’t detected by the classifier, sessions won’t lock.</p>
  </li>
</ol>

<h3 id="63-what-we-can-validate">6.3 What We Can Validate</h3>

<p>Despite no sessions locking, the core privacy invariant held:</p>

<ul>
  <li><strong>Trapdoor violations: 0</strong> — No request with <em>detected</em> PII ever routed to cloud</li>
  <li><strong>Per-request classification: Functional</strong> — Requests that were classified as sensitive routed locally</li>
</ul>

<p>The gap is between “contains PII” (ground truth) and “detected as PII” (classifier output).</p>

<h3 id="64-required-follow-up">6.4 Required Follow-Up</h3>

<p>A proper session stickiness evaluation requires:</p>

<ol>
  <li><strong>Fix classification first:</strong> Address the Tier 3 detection gaps so PII is actually detected</li>
  <li><strong>Validate session ID generation:</strong> Ensure synthetic client IPs produce distinct session IDs</li>
  <li><strong>Use diverse source IPs:</strong> Run from multiple actual hosts or containers</li>
  <li><strong>Add session state logging:</strong> Instrument the session manager to log state transitions</li>
</ol>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Re-run with verified distinct sessions</span>
python <span class="nt">-m</span> benchmarks.harness <span class="nt">--experiment</span> session <span class="nt">--sessions</span> 20 <span class="nt">--verify-session-ids</span>
</code></pre></div></div>

<p>Expected behavior after fixes:</p>
<ul>
  <li>Sessions should lock when Tier 2+ PII is detected</li>
  <li>Subsequent requests in that session should route to local regardless of content</li>
  <li>Locked session count should approximate <code class="language-plaintext highlighter-rouge">sessions × pii_probability × detection_rate</code></li>
</ul>

<hr />

<h2 id="7-discussion">7. Discussion</h2>

<h3 id="71-principal-findings">7.1 Principal Findings</h3>

<table>
  <thead>
    <tr>
      <th>Hypothesis</th>
      <th>Result</th>
      <th>Verdict</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Classification adds minimal latency</td>
      <td>1.69ms overhead</td>
      <td>✅ Confirmed</td>
    </tr>
    <tr>
      <td>Accuracy exceeds 95%</td>
      <td>97.5% overall</td>
      <td>✅ Confirmed</td>
    </tr>
    <tr>
      <td>Local inference is slower</td>
      <td>10× latency penalty</td>
      <td>⚠️ Confirmed (expected)</td>
    </tr>
    <tr>
      <td>Cost savings are significant</td>
      <td>68.6% reduction</td>
      <td>✅ Confirmed</td>
    </tr>
    <tr>
      <td>Controller generates recommendations</td>
      <td>0 recommendations</td>
      <td>⚠️ By design (quality gap)</td>
    </tr>
    <tr>
      <td>Sessions lock on PII detection</td>
      <td>0 sessions locked</td>
      <td>⚠️ Requires improved test methodology</td>
    </tr>
  </tbody>
</table>

<h3 id="72-limitations">7.2 Limitations</h3>

<p><strong>Dataset Size:</strong> 200 prompts is sufficient for detecting large effect sizes but underpowered for rare failure modes. A production evaluation should use 10,000+ samples.</p>

<p><strong>Synthetic Data:</strong> The evaluation dataset was synthetically generated with known patterns. Real-world PII distributions may differ, particularly for domain-specific identifiers.</p>

<p><strong>Single Hardware Configuration:</strong> Results reflect a specific hardware setup (M4 Mac Mini, 16GB). Performance characteristics will vary with different local inference hardware.</p>

<p><strong>NER Model Limitations:</strong> The lightweight BERT NER model (<code class="language-plaintext highlighter-rouge">dslim/bert-base-NER</code>, “fast” mode) failed to detect person names in medical record contexts, revealing domain-specific NER gaps that require either fine-tuning or larger models like RoBERTa.</p>

<p><strong>Session Test Methodology:</strong> The session stickiness experiment used simulated IPs from a single host, and classification failures prevented some PII-containing requests from triggering session locks.</p>

<h3 id="73-threats-to-validity">7.3 Threats to Validity</h3>

<p><strong>Internal Validity:</strong> The 17 routing failures (8.5%) due to memory pressure may have biased latency statistics toward successful (potentially faster) requests.</p>

<p><strong>External Validity:</strong> The balanced tier distribution (25% per tier) does not reflect production traffic, which is typically skewed toward Tier 0-1.</p>

<p><strong>Construct Validity:</strong> Semantic similarity (cosine distance on embeddings) may not capture task-specific quality dimensions relevant to specific use cases.</p>

<hr />

<h2 id="8-conclusion">8. Conclusion</h2>

<h3 id="81-summary-of-contributions">8.1 Summary of Contributions</h3>

<p>This work presents <strong>inference-sentinel</strong>, a privacy-aware LLM routing gateway that addresses a gap in the current MLOps landscape: the ability to enforce data residency policies at inference time without sacrificing developer experience.</p>

<p><strong>What we built:</strong></p>

<table>
  <thead>
    <tr>
      <th>Component</th>
      <th>Contribution</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Hybrid Classifier</strong></td>
      <td>Two-stage pipeline (regex + NER) achieving 97.5% accuracy at 0.16ms latency — fast enough for real-time routing decisions</td>
    </tr>
    <tr>
      <td><strong>Session Manager</strong></td>
      <td>One-way trapdoor state machine ensuring PII-containing sessions are permanently locked to local inference, with cryptographic session ID hashing</td>
    </tr>
    <tr>
      <td><strong>Context Handoff</strong></td>
      <td>Rolling buffer mechanism preserving conversation continuity during mid-session cloud→local transitions, with optional PII scrubbing</td>
    </tr>
    <tr>
      <td><strong>Backend Manager</strong></td>
      <td>Pluggable selection strategies (priority, round-robin, latency-aware) with automatic health checking and failover</td>
    </tr>
    <tr>
      <td><strong>Shadow Mode</strong></td>
      <td>Non-blocking A/B comparison framework collecting quality metrics without impacting user-facing latency</td>
    </tr>
    <tr>
      <td><strong>Closed-Loop Controller</strong></td>
      <td>Rule-based recommendation engine that observes traffic patterns and suggests routing policy adjustments</td>
    </tr>
    <tr>
      <td><strong>Observability Stack</strong></td>
      <td>Full OpenTelemetry integration with Prometheus metrics, Grafana dashboards, and structured logging</td>
    </tr>
  </tbody>
</table>

<p><strong>What the benchmarks revealed:</strong></p>

<p>The evaluation across 200 synthetic prompts demonstrated that privacy-aware routing is feasible with sub-2ms overhead. The 68.6% cost savings from local routing validates the economic case, while the 10× latency penalty quantifies the privacy-performance trade-off. The systematic Tier 3 failures (health insurance IDs) highlight the importance of domain-specific pattern engineering and robust NER model selection — even with NER enabled, lightweight models may miss entities in specialized contexts.</p>

<p>This is not a production-ready system — it is a <strong>proof of architecture</strong> demonstrating that the building blocks exist and compose correctly.</p>

<hr />

<h3 id="82-future-work">8.2 Future Work</h3>

<h4 id="821-scaling-the-evaluation">8.2.1 Scaling the Evaluation</h4>

<p>The current benchmark uses 200 prompts — sufficient for detecting large effects but underpowered for tail behavior analysis. Future work should include:</p>

<table>
  <thead>
    <tr>
      <th>Benchmark</th>
      <th>Purpose</th>
      <th>Target</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Extended classification</strong></td>
      <td>Rare pattern detection, edge cases</td>
      <td>1,000+ prompts</td>
    </tr>
    <tr>
      <td><strong>Load testing</strong></td>
      <td>Concurrent request handling, memory pressure</td>
      <td>100 req/s sustained</td>
    </tr>
    <tr>
      <td><strong>Adversarial evaluation</strong></td>
      <td>Evasion attempts, prompt injection</td>
      <td>Red team dataset</td>
    </tr>
    <tr>
      <td><strong>Longitudinal study</strong></td>
      <td>Drift detection over weeks of traffic</td>
      <td>Production deployment</td>
    </tr>
  </tbody>
</table>

<p>Statistical power analysis suggests n=1,000+ is required to detect failure modes occurring at &lt;1% frequency with 95% confidence.</p>

<h4 id="822-improving-ner-accuracy">8.2.2 Improving NER Accuracy</h4>

<p>The current implementation uses HuggingFace Transformers with <code class="language-plaintext highlighter-rouge">dslim/bert-base-NER</code> (“fast” mode, ~400MB) for named entity recognition. Several directions merit exploration:</p>

<table>
  <thead>
    <tr>
      <th>Model</th>
      <th>Size</th>
      <th>Tradeoff</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">dslim/bert-base-NER</code> (fast)</td>
      <td>~400MB</td>
      <td>Current default, good speed, limited domain coverage</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">Jean-Baptiste/roberta-large-ner-english</code> (accurate)</td>
      <td>~1.3GB</td>
      <td>Higher accuracy, 3-5× latency</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">Davlan/bert-base-multilingual-cased-ner-hrl</code> (multilingual)</td>
      <td>~700MB</td>
      <td>Multi-language support</td>
    </tr>
    <tr>
      <td><strong>Fine-tuned NER</strong></td>
      <td>Variable</td>
      <td>Domain-specific entities (PHI, financial identifiers)</td>
    </tr>
    <tr>
      <td><strong>Presidio</strong></td>
      <td>N/A</td>
      <td>Microsoft’s PII detection library, rule + ML hybrid</td>
    </tr>
    <tr>
      <td><strong>GLiNER</strong></td>
      <td>200MB</td>
      <td>Zero-shot NER, no fine-tuning required</td>
    </tr>
  </tbody>
</table>

<p>The optimal choice depends on latency budget. For sub-50ms classification, the “fast” BERT model with expanded regex patterns may outperform larger models. For offline batch classification, RoBERTa-based NER (“accurate” mode) is viable.</p>

<h4 id="823-gpu-accelerated-local-inference">8.2.3 GPU-Accelerated Local Inference</h4>

<p>The current setup runs local models on Apple Silicon (M4 Mac Mini) using Metal acceleration via Ollama. While sufficient for development and low-throughput production, this architecture has limitations:</p>

<table>
  <thead>
    <tr>
      <th>Constraint</th>
      <th>Current</th>
      <th>With GPU</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>VRAM</td>
      <td>16GB unified</td>
      <td>24-80GB dedicated</td>
    </tr>
    <tr>
      <td>Concurrent models</td>
      <td>1-2 (memory pressure)</td>
      <td>3-4+</td>
    </tr>
    <tr>
      <td>Throughput</td>
      <td>~0.1 req/s</td>
      <td>1-10 req/s</td>
    </tr>
    <tr>
      <td>Model size</td>
      <td>4-8B parameters</td>
      <td>13-70B parameters</td>
    </tr>
  </tbody>
</table>

<p><strong>GPU deployment options:</strong></p>

<ol>
  <li><strong>NVIDIA GPU server</strong> (RTX 4090, A100): Run vLLM or TGI for high-throughput local inference</li>
  <li><strong>Cloud GPU instances</strong> (but local network): AWS/GCP instances in private VPC, data never leaves controlled infrastructure</li>
  <li><strong>Apple M4 Max/Ultra</strong>: 128GB unified memory enables 70B models with acceptable latency</li>
</ol>

<p>The shadow mode quality metrics would likely improve significantly with larger local models, potentially enabling automatic traffic promotion.</p>

<h4 id="824-kubernetes-deployment">8.2.4 Kubernetes Deployment</h4>

<p>The current Docker Compose stack is suitable for single-node deployment. Production deployment requires:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>┌─────────────────────────────────────────────────────────────┐
│                    Kubernetes Cluster                       │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐          │
│  │  Sentinel   │  │  Sentinel   │  │  Sentinel   │          │
│  │  Pod (HPA)  │  │  Pod (HPA)  │  │  Pod (HPA)  │          │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘          │
│         └────────────────┼────────────────┘                 │
│                          ▼                                  │
│                 ┌─────────────────┐                         │
│                 │ Redis (Session  │                         │
│                 │    State)       │                         │
│                 └─────────────────┘                         │
│                          │                                  │
│         ┌────────────────┼────────────────┐                 │
│         ▼                ▼                ▼                 │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐          │
│  │   Ollama    │  │   Ollama    │  │    vLLM     │          │
│  │  (gemma3)   │  │  (mistral)  │  │  (llama3)   │          │
│  │   Node 1    │  │   Node 2    │  │  GPU Node   │          │
│  └─────────────┘  └─────────────┘  └─────────────┘          │
└─────────────────────────────────────────────────────────────┘
</code></pre></div></div>

<p><strong>K8s-specific requirements:</strong></p>
<ul>
  <li><strong>Horizontal Pod Autoscaler (HPA)</strong> for gateway pods based on request rate</li>
  <li><strong>Node affinity</strong> for GPU-accelerated inference pods</li>
  <li><strong>PodDisruptionBudget</strong> ensuring availability during rollouts</li>
  <li><strong>NetworkPolicy</strong> restricting egress to approved cloud endpoints</li>
  <li><strong>ServiceMesh</strong> (Istio/Linkerd) for mTLS between components</li>
</ul>

<h4 id="825-session-state-in-memory-vs-persistent-storage">8.2.5 Session State: In-Memory vs. Persistent Storage</h4>

<p>The current architecture stores session state in an in-memory data structure with TTL-based eviction. This is a deliberate design choice:</p>

<table>
  <thead>
    <tr>
      <th>Approach</th>
      <th>Pros</th>
      <th>Cons</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>In-memory (current)</strong></td>
      <td>Zero latency, no external dependencies, automatic cleanup</td>
      <td>Lost on restart, single-node only</td>
    </tr>
    <tr>
      <td><strong>Redis</strong></td>
      <td>Distributed, persistent, TTL support native</td>
      <td>Additional infra, latency (+1-2ms), security surface</td>
    </tr>
    <tr>
      <td><strong>PostgreSQL</strong></td>
      <td>ACID, queryable, audit trail</td>
      <td>Highest latency, schema management, backup complexity</td>
    </tr>
  </tbody>
</table>

<p><strong>Why in-memory is defensible:</strong></p>

<ol>
  <li><strong>Session data is ephemeral by design</strong> — 15-minute TTL means losing state on restart is acceptable for most use cases</li>
  <li><strong>Security through ephemerality</strong> — no persistent store means no data to exfiltrate, no backups to secure, no encryption-at-rest requirements</li>
  <li><strong>Operational simplicity</strong> — no Redis cluster to manage, no connection pooling, no failover logic</li>
  <li><strong>Latency</strong> — hash table lookup is O(1) with ~0.001ms; Redis adds network round-trip</li>
</ol>

<p><strong>When to move to Redis:</strong></p>

<ul>
  <li>Multi-pod deployment requiring shared session state</li>
  <li>Session TTL &gt;1 hour (memory pressure)</li>
  <li>Audit requirements mandating session history</li>
  <li>Graceful restart without session loss</li>
</ul>

<p>For the Redis migration path, the interface is already abstracted:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">SessionStore</span><span class="p">(</span><span class="n">Protocol</span><span class="p">):</span>
    <span class="k">async</span> <span class="k">def</span> <span class="nf">get</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">session_id</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Optional</span><span class="p">[</span><span class="n">Session</span><span class="p">]:</span> <span class="p">...</span>
    <span class="k">async</span> <span class="k">def</span> <span class="nf">set</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">session_id</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">session</span><span class="p">:</span> <span class="n">Session</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span> <span class="p">...</span>
    <span class="k">async</span> <span class="k">def</span> <span class="nf">delete</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">session_id</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span> <span class="p">...</span>

<span class="c1"># Swap implementations without changing business logic
</span><span class="n">store</span> <span class="o">=</span> <span class="n">RedisSessionStore</span><span class="p">(</span><span class="n">redis_url</span><span class="p">)</span> <span class="k">if</span> <span class="n">USE_REDIS</span> <span class="k">else</span> <span class="n">InMemorySessionStore</span><span class="p">()</span>
</code></pre></div></div>

<hr />

<h3 id="83-target-applications">8.3 Target Applications</h3>

<p>inference-sentinel addresses use cases across multiple organizational functions:</p>

<h4 id="healthcare--life-sciences">Healthcare &amp; Life Sciences</h4>

<table>
  <thead>
    <tr>
      <th>Use Case</th>
      <th>Privacy Concern</th>
      <th>Sentinel Solution</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Clinical decision support</td>
      <td>PHI in patient queries</td>
      <td>Tier 3 classification for health records</td>
    </tr>
    <tr>
      <td>Drug interaction lookup</td>
      <td>Patient medication history</td>
      <td>Session locking after first PHI exposure</td>
    </tr>
    <tr>
      <td>Medical transcription</td>
      <td>Dictated patient notes</td>
      <td>Local inference for all transcription</td>
    </tr>
  </tbody>
</table>

<p><strong>Regulatory context:</strong> HIPAA requires technical safeguards for PHI. A gateway that provably routes PHI to local-only inference provides auditable compliance.</p>

<h4 id="financial-services">Financial Services</h4>

<table>
  <thead>
    <tr>
      <th>Use Case</th>
      <th>Privacy Concern</th>
      <th>Sentinel Solution</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Customer service chatbots</td>
      <td>Account numbers, SSNs</td>
      <td>Tier 3 detection with immediate local routing</td>
    </tr>
    <tr>
      <td>Fraud analysis</td>
      <td>Transaction patterns</td>
      <td>Shadow mode for quality validation before local promotion</td>
    </tr>
    <tr>
      <td>Document summarization</td>
      <td>Contracts with PII</td>
      <td>Context handoff preserving conversation flow</td>
    </tr>
  </tbody>
</table>

<p><strong>Regulatory context:</strong> PCI-DSS, SOX, and GLBA impose data handling requirements that a privacy-aware gateway can enforce at the infrastructure layer.</p>

<h4 id="legal--professional-services">Legal &amp; Professional Services</h4>

<table>
  <thead>
    <tr>
      <th>Use Case</th>
      <th>Privacy Concern</th>
      <th>Sentinel Solution</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Contract review</td>
      <td>Client confidential information</td>
      <td>Tier 2+ routing for all legal documents</td>
    </tr>
    <tr>
      <td>Legal research</td>
      <td>Case details, party names</td>
      <td>Session stickiness ensuring full conversation stays local</td>
    </tr>
    <tr>
      <td>E-discovery</td>
      <td>Privileged communications</td>
      <td>Mandatory local inference for attorney-client content</td>
    </tr>
  </tbody>
</table>

<h4 id="enterprise-it--security">Enterprise IT &amp; Security</h4>

<table>
  <thead>
    <tr>
      <th>Use Case</th>
      <th>Privacy Concern</th>
      <th>Sentinel Solution</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Code review assistants</td>
      <td>Proprietary source code</td>
      <td>Internal URL and project code detection (Tier 1)</td>
    </tr>
    <tr>
      <td>Security log analysis</td>
      <td>Infrastructure details, credentials</td>
      <td>API key and credential pattern detection (Tier 3)</td>
    </tr>
    <tr>
      <td>Internal knowledge base Q&amp;A</td>
      <td>Employee PII, org structure</td>
      <td>Configurable routing based on data classification</td>
    </tr>
  </tbody>
</table>

<h4 id="human-resources">Human Resources</h4>

<table>
  <thead>
    <tr>
      <th>Use Case</th>
      <th>Privacy Concern</th>
      <th>Sentinel Solution</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Resume screening</td>
      <td>Candidate PII</td>
      <td>Local inference for all recruitment workflows</td>
    </tr>
    <tr>
      <td>Employee feedback analysis</td>
      <td>Performance data</td>
      <td>Session locking after employee identifier detection</td>
    </tr>
    <tr>
      <td>Compensation benchmarking</td>
      <td>Salary information</td>
      <td>Tier 3 classification for financial PII</td>
    </tr>
  </tbody>
</table>

<hr />

<h3 id="84-broader-impact">8.4 Broader Impact</h3>

<p>The proliferation of LLM-powered applications creates a tension between capability and privacy. Cloud-hosted models offer superior performance but require transmitting potentially sensitive data to third parties. Local models preserve privacy but sacrifice quality and increase operational burden.</p>

<p>inference-sentinel demonstrates that <strong>this is not a binary choice</strong>. By making routing decisions at inference time based on content classification, organizations can:</p>

<ol>
  <li><strong>Use cloud models for the 50%+ of traffic that contains no sensitive data</strong> — capturing the quality and cost benefits</li>
  <li><strong>Enforce local-only inference for genuinely sensitive content</strong> — preserving privacy guarantees</li>
  <li><strong>Measure the trade-off empirically</strong> — shadow mode quantifies exactly what quality you sacrifice for privacy</li>
</ol>

<p>This is a fundamentally different approach from “all cloud” or “all local” architectures. It treats privacy as a first-class routing dimension, alongside latency and cost.</p>

<hr />

<h3 id="85-closing-thoughts">8.5 Closing Thoughts</h3>

<p>Building inference-sentinel during my job search taught me more about LLM infrastructure than any tutorial could. Debugging why Ollama returns 404 (model name mismatch), why Grafana dashboards show “Value” instead of model names (missing metric labels), why round-robin wasn’t working (YAML override precedence) — these are the unglamorous details that separate working systems from prototypes.</p>

<p>The code is open source. The architecture is documented. The benchmarks are reproducible.</p>

<p>If you’re building LLM applications that handle sensitive data, I hope this work provides a useful reference — or at least saves you from repeating my mistakes.</p>

<hr />

<h2 id="appendix-a-raw-data">Appendix A: Raw Data</h2>

<h3 id="a1-classification-misclassifications">A.1 Classification Misclassifications</h3>

<p>All 5 errors follow the pattern:</p>
<ul>
  <li><strong>Input:</strong> Health record with insurance ID format <code class="language-plaintext highlighter-rouge">[A-Z]{3}\d{9}</code></li>
  <li><strong>Expected entities:</strong> <code class="language-plaintext highlighter-rouge">PERSON_NAME</code> (not detected by BERT NER in medical context)</li>
  <li><strong>Detected entities:</strong> <code class="language-plaintext highlighter-rouge">[]</code></li>
</ul>

<h3 id="a2-routing-errors">A.2 Routing Errors</h3>

<p>All 17 errors: <code class="language-plaintext highlighter-rouge">HTTP 503: No healthy local backends available</code></p>

<p>Distribution: 8 Tier 2, 9 Tier 3 (local-routed traffic only)</p>

<h3 id="a3-entity-detection-summary">A.3 Entity Detection Summary</h3>

<table>
  <thead>
    <tr>
      <th>Entity Type</th>
      <th>Count</th>
      <th>Tier Assignment</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>SSN</td>
      <td>24</td>
      <td>3</td>
    </tr>
    <tr>
      <td>Credit Card</td>
      <td>12</td>
      <td>3</td>
    </tr>
    <tr>
      <td>Health Record (MRN)</td>
      <td>6</td>
      <td>3</td>
    </tr>
    <tr>
      <td>Bank Account</td>
      <td>6</td>
      <td>3</td>
    </tr>
    <tr>
      <td>Email</td>
      <td>18</td>
      <td>2</td>
    </tr>
    <tr>
      <td>Phone</td>
      <td>14</td>
      <td>2</td>
    </tr>
    <tr>
      <td>Address</td>
      <td>16</td>
      <td>2</td>
    </tr>
    <tr>
      <td>Internal URL</td>
      <td>28</td>
      <td>1</td>
    </tr>
    <tr>
      <td>Project Code</td>
      <td>12</td>
      <td>1</td>
    </tr>
    <tr>
      <td>Employee ID</td>
      <td>10</td>
      <td>1</td>
    </tr>
  </tbody>
</table>

<hr />

<p><strong>GitHub:</strong> <a href="https://github.com/kraghavan/inference-sentinel">github.com/kraghavan/inference-sentinel</a></p>

<p><em>Questions, feedback, or collaboration ideas? Connect on <a href="https://linkedin.com/in/karthikaraghavan">LinkedIn</a>.</em></p>

<p><strong>Updated Last:</strong> March 24, 2026</p>]]></content><author><name>Karthika Raghavan</name></author><category term="llm" /><category term="infrastructure" /><category term="benchmarks" /><category term="python" /><category term="privacy" /><category term="observability" /><category term="distributed-systems" /><category term="evaluation" /><summary type="html"><![CDATA[Part 2 of 2: Empirical evaluation of classification accuracy, routing performance, and cost attribution — with honest analysis of failure modes]]></summary></entry><entry><title type="html">Building a Privacy-Aware LLM Gateway: Architecture Deep-Dive</title><link href="https://kraghavan.github.io/llm/infrastructure/smart%20gateway/python/2026/03/20/inference-sentinel-architecture2.html" rel="alternate" type="text/html" title="Building a Privacy-Aware LLM Gateway: Architecture Deep-Dive" /><published>2026-03-20T00:00:00+00:00</published><updated>2026-03-20T00:00:00+00:00</updated><id>https://kraghavan.github.io/llm/infrastructure/smart%20gateway/python/2026/03/20/inference-sentinel-architecture2</id><content type="html" xml:base="https://kraghavan.github.io/llm/infrastructure/smart%20gateway/python/2026/03/20/inference-sentinel-architecture2.html"><![CDATA[<h1 id="building-a-privacy-aware-llm-gateway-architecture-deep-dive">Building a Privacy-Aware LLM Gateway: Architecture Deep-Dive</h1>

<p><em>Part 1 of 2: Design decisions, trade-offs, and lessons from building inference-sentinel</em></p>

<hr />

<h2 id="why-i-built-this">Why I Built This</h2>

<p>During my job search, I wanted a project that would demonstrate distributed systems thinking — not just “I can call an API,” but “I can design a system that handles real production concerns.”</p>

<p>The problem I chose: <strong>How do you route LLM prompts intelligently based on data sensitivity, while maintaining session continuity and observability?</strong></p>

<p>This post walks through every architectural decision I made building <a href="https://github.com/kraghavan/inference-sentinel">inference-sentinel</a>, including the trade-offs I considered and the mistakes I made along the way.</p>

<hr />

<h2 id="system-overview">System Overview</h2>

<h3 id="design-diagram-generated-by-notebookllm">Design Diagram generated by NotebookLLM</h3>
<p><img src="/assets/images/inference-sentinel/Inference-Sentinel.png" alt="Inference Sentinel Architecture" /> <em>The Inference Sentinel architecture: privacy-aware routing with session stickiness and closed-loop control</em></p>

<h3 id="components">Components</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>┌─────────────────────────────────────────────────────────────────────────────┐
│                              Application Layer                              │
│                         (Any OpenAI-compatible client)                      │
└─────────────────────────────────────┬───────────────────────────────────────┘
                                      │ POST /v1/inference
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                            inference-sentinel                               │
│                                                                             │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │                         Request Pipeline                             │   │
│  │                                                                      │   │
│  │  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐               │   │
│  │  │   Hybrid    │    │   Session   │    │   Backend   │               │   │
│  │  │ Classifier  │───▶│   Manager   │───▶│   Manager   │               │   │
│  │  │             │    │             │    │             │               │   │
│  │  │ • Regex     │    │ • Trapdoor  │    │ • Selection │               │   │
│  │  │ • NER       │    │ • Buffer    │    │ • Failover  │               │   │
│  │  │ • Tier 0-3  │    │ • Handoff   │    │ • Health    │               │   │
│  │  └─────────────┘    └─────────────┘    └─────────────┘               │   │
│  │         │                  │                  │                      │   │
│  │         ▼                  ▼                  ▼                      │   │
│  │  ┌─────────────────────────────────────────────────────────────────┐ │   │
│  │  │                    Telemetry Collector                          │ │   │
│  │  │         Metrics │ Traces │ Logs (OpenTelemetry-native)          │ │   │
│  │  └─────────────────────────────────────────────────────────────────┘ │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │                      Background Services                             │   │
│  │                                                                      │   │
│  │  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐               │   │
│  │  │   Shadow    │    │   Closed    │    │   Health    │               │   │
│  │  │   Mode      │    │   Loop      │    │   Check     │               │   │
│  │  │             │    │ Controller  │    │   Loop      │               │   │
│  │  │ • A/B test  │    │             │    │             │               │   │
│  │  │ • Compare   │    │ • Evaluate  │    │ • Poll      │               │   │
│  │  │ • Metrics   │    │ • Recommend │    │ • Failover  │               │   │
│  │  └─────────────┘    └─────────────┘    └─────────────┘               │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
                    ┌─────────────────┴─────────────────┐
                    ▼                                   ▼
        ┌─────────────────────┐             ┌─────────────────────┐
        │    Local Backends   │             │   Cloud Backends    │
        │                     │             │                     │
        │  ┌───────────────┐  │             │  ┌───────────────┐  │
        │  │ Ollama        │  │             │  │ Anthropic     │  │
        │  │ • gemma3:4b   │  │             │  │ • Claude      │  │
        │  │ • mistral     │  │             │  │   Sonnet 4    │  │
        │  └───────────────┘  │             │  └───────────────┘  │
        │                     │             │  ┌───────────────┐  │
        │  (Round-robin at    │             │  │ Google        │  │
        │   equal priority)   │             │  │ • Gemini      │  │
        │                     │             │  │   2.0 Flash   │  │
        └─────────────────────┘             │  └───────────────┘  │
                                            │                     │
                                            │  (Round-robin or    │
                                            │   primary/fallback) │
                                            └─────────────────────┘
</code></pre></div></div>

<hr />

<h2 id="component-1-the-hybrid-classifier">Component 1: The Hybrid Classifier</h2>

<h3 id="the-problem">The Problem</h3>

<p>I needed to classify prompts into privacy tiers quickly and accurately. The obvious approaches each have limitations:</p>

<table>
  <thead>
    <tr>
      <th>Approach</th>
      <th>Latency</th>
      <th>Accuracy</th>
      <th>Limitations</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Regex only</td>
      <td>&lt;1ms</td>
      <td>~85%</td>
      <td>Misses context, false negatives</td>
    </tr>
    <tr>
      <td>LLM-based</td>
      <td>200-500ms</td>
      <td>~95%</td>
      <td>Too slow for a gateway</td>
    </tr>
    <tr>
      <td>NER only</td>
      <td>15-50ms</td>
      <td>~90%</td>
      <td>Heavy for simple patterns</td>
    </tr>
  </tbody>
</table>

<h3 id="my-solution-two-stage-pipeline">My Solution: Two-Stage Pipeline</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">HybridClassifier</span><span class="p">:</span>
    <span class="s">"""
    Stage 1: Fast-path regex for obvious patterns
    Stage 2: NER for context-dependent detection (optional)
    """</span>
    
    <span class="k">async</span> <span class="k">def</span> <span class="nf">classify</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">text</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">HybridResult</span><span class="p">:</span>
        <span class="c1"># Stage 1: Regex (always runs, ~0.2ms)
</span>        <span class="n">regex_result</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_regex</span><span class="p">.</span><span class="n">classify</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
        
        <span class="c1"># Tier 3 detected by regex — skip NER, route local immediately
</span>        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">skip_ner_on_tier3</span> <span class="ow">and</span> <span class="n">regex_result</span><span class="p">.</span><span class="n">tier</span> <span class="o">&gt;=</span> <span class="mi">3</span><span class="p">:</span>
            <span class="k">return</span> <span class="n">HybridResult</span><span class="p">(</span>
                <span class="n">tier</span><span class="o">=</span><span class="n">regex_result</span><span class="p">.</span><span class="n">tier</span><span class="p">,</span>
                <span class="n">detection_method</span><span class="o">=</span><span class="s">"regex"</span><span class="p">,</span>
            <span class="p">)</span>
        
        <span class="c1"># Check if NER should run
</span>        <span class="k">if</span> <span class="ow">not</span> <span class="bp">self</span><span class="p">.</span><span class="n">_should_run_ner</span><span class="p">(</span><span class="n">regex_result</span><span class="p">.</span><span class="n">tier</span><span class="p">):</span>
            <span class="k">return</span> <span class="n">HybridResult</span><span class="p">(</span>
                <span class="n">tier</span><span class="o">=</span><span class="n">regex_result</span><span class="p">.</span><span class="n">tier</span><span class="p">,</span>
                <span class="n">detection_method</span><span class="o">=</span><span class="s">"regex"</span><span class="p">,</span>
            <span class="p">)</span>
        
        <span class="c1"># Stage 2: NER for additional context (~15-30ms on CPU)
</span>        <span class="n">ner_result</span> <span class="o">=</span> <span class="k">await</span> <span class="bp">self</span><span class="p">.</span><span class="n">_ner</span><span class="p">.</span><span class="n">classify</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
        
        <span class="c1"># Merge entities from both, take highest tier
</span>        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">_merge_results</span><span class="p">(</span><span class="n">regex_result</span><span class="p">,</span> <span class="n">ner_result</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="design-decision-why-regex-first">Design Decision: Why Regex First?</h3>

<p><strong>Trade-off considered:</strong> I could run NER on everything for maximum accuracy, but:</p>

<ol>
  <li><strong>Latency budget:</strong> A gateway adds latency to every request. I targeted &lt;5ms for classification on the fast path (regex-only).</li>
  <li><strong>Resource cost:</strong> The NER classifier uses HuggingFace Transformers with BERT-based models (~400MB for the “fast” model, ~1.3GB for “accurate”). For high-throughput scenarios, this matters.</li>
  <li><strong>Diminishing returns:</strong> Regex catches 70%+ of Tier 3 patterns (SSN, credit cards, API keys) with near-zero cost.</li>
</ol>

<p><strong>The insight:</strong> Regex handles the “obvious” cases; NER handles the “subtle” cases (person names, organizations). Running both in parallel would be faster but would waste resources when regex already found Tier 3 data.</p>

<h3 id="the-4-tier-taxonomy">The 4-Tier Taxonomy</h3>

<table>
  <thead>
    <tr>
      <th>Tier</th>
      <th>Name</th>
      <th>Detection Method</th>
      <th>Examples</th>
      <th>Routing</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>0</td>
      <td>PUBLIC</td>
      <td>Default (no matches)</td>
      <td>“Explain quantum computing”</td>
      <td>Cloud</td>
    </tr>
    <tr>
      <td>1</td>
      <td>INTERNAL</td>
      <td>Regex patterns</td>
      <td>Internal URLs, project codes</td>
      <td>Cloud</td>
    </tr>
    <tr>
      <td>2</td>
      <td>CONFIDENTIAL</td>
      <td>NER + Regex</td>
      <td>Person names (NER), emails, phones (regex)</td>
      <td>Local (configurable)</td>
    </tr>
    <tr>
      <td>3</td>
      <td>RESTRICTED</td>
      <td>Regex patterns</td>
      <td>SSN, credit cards, health data</td>
      <td>Local (enforced)</td>
    </tr>
  </tbody>
</table>

<h3 id="design-decision-why-4-tiers">Design Decision: Why 4 Tiers?</h3>

<p><strong>Trade-off considered:</strong> Simpler would be binary (sensitive/not-sensitive). More granular could be 10+ categories.</p>

<p><strong>Why 4:</strong></p>
<ol>
  <li>Maps to common enterprise data classification schemes (Public/Internal/Confidential/Restricted)</li>
  <li>Allows nuanced routing rules (Tier 2 is configurable, Tier 3 is enforced)</li>
  <li>Shadow mode can target specific tiers for A/B testing</li>
  <li>Simple enough to reason about, granular enough to be useful</li>
</ol>

<h3 id="regex-pattern-design">Regex Pattern Design</h3>

<p>Patterns are loaded from <code class="language-plaintext highlighter-rouge">config/privacy_taxonomy.yaml</code>, allowing customization without code changes:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Example Tier 3 patterns (loaded from YAML at startup)
</span><span class="n">TIER_3_PATTERNS</span> <span class="o">=</span> <span class="p">{</span>
    <span class="c1"># Social Security Numbers (various formats)
</span>    <span class="s">"ssn"</span><span class="p">:</span> <span class="sa">r</span><span class="s">'\b\d{3}-\d{2}-\d{4}\b'</span><span class="p">,</span>
    <span class="s">"ssn_spoken"</span><span class="p">:</span> <span class="sa">r</span><span class="s">'\b(?:social|ssn)[\s:]*\d{3}[\s-]?\d{2}[\s-]?\d{4}\b'</span><span class="p">,</span>
    
    <span class="c1"># Credit Cards (Luhn-valid patterns)
</span>    <span class="s">"credit_card"</span><span class="p">:</span> <span class="sa">r</span><span class="s">'\b(?:4\d{3}|5[1-5]\d{2}|3[47]\d{2}|6(?:011|5\d{2}))'</span>
                   <span class="sa">r</span><span class="s">'[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b'</span><span class="p">,</span>
    
    <span class="c1"># API Keys (common prefixes)
</span>    <span class="s">"api_key"</span><span class="p">:</span> <span class="sa">r</span><span class="s">'\b(?:sk-|pk_|api[_-]?key[_-]?)[a-zA-Z0-9]{20,}\b'</span><span class="p">,</span>
    
    <span class="c1"># Health identifiers
</span>    <span class="s">"medical_record"</span><span class="p">:</span> <span class="sa">r</span><span class="s">'\b(?:MRN|patient[\s-]?id)[\s:]*[A-Z0-9]{6,}\b'</span><span class="p">,</span>
<span class="p">}</span>
</code></pre></div></div>

<p><strong>Lesson learned:</strong> The regex <code class="language-plaintext highlighter-rouge">\b</code> word boundary is essential. Without it, <code class="language-plaintext highlighter-rouge">123-45-6789</code> matches inside <code class="language-plaintext highlighter-rouge">abc123-45-6789xyz</code>, causing false positives.</p>

<hr />

<h2 id="component-2-session-manager-the-one-way-trapdoor">Component 2: Session Manager (The One-Way Trapdoor)</h2>

<h3 id="the-problem-1">The Problem</h3>

<p>Per-request classification isn’t enough. Consider this conversation:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Turn 1: "Help me draft a cover letter"        → Tier 0 → Cloud ✓
Turn 2: "Here's my resume"                     → Tier 1 → Cloud ✓
Turn 3: "My SSN is 123-45-6789"               → Tier 3 → Local ✓
Turn 4: "Actually, format that differently"   → Tier 0 → Cloud? ❌
</code></pre></div></div>

<p>Turn 4 references Turn 3’s SSN implicitly. If we route it to cloud, we’ve leaked context.</p>

<h3 id="my-solution-one-way-state-machine">My Solution: One-Way State Machine</h3>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>                    ┌─────────────────────┐
                    │                     │
        ┌──────────▶│   CLOUD_ELIGIBLE    │
        │           │                     │
        │           └──────────┬──────────┘
        │                      │
        │         Tier 2/3 detected in any turn
        │                      │
        │                      ▼
        │           ┌─────────────────────┐
   NEVER            │                     │
        │           │    LOCAL_LOCKED     │
        │           │                     │
        └───────────└─────────────────────┘
</code></pre></div></div>

<p><strong>Key property:</strong> The transition is irreversible. Once a session is <code class="language-plaintext highlighter-rouge">LOCAL_LOCKED</code>, no subsequent classification — even Tier 0 — can unlock it.</p>

<h3 id="implementation">Implementation</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">SessionState</span><span class="p">(</span><span class="nb">str</span><span class="p">,</span> <span class="n">Enum</span><span class="p">):</span>
    <span class="s">"""Session routing state."""</span>
    <span class="n">CLOUD_ELIGIBLE</span> <span class="o">=</span> <span class="s">"cloud_eligible"</span>  <span class="c1"># Can route to cloud
</span>    <span class="n">LOCAL_LOCKED</span> <span class="o">=</span> <span class="s">"local_locked"</span>      <span class="c1"># Must route to local (PII detected)
</span>

<span class="o">@</span><span class="n">dataclass</span>
<span class="k">class</span> <span class="nc">SessionInfo</span><span class="p">:</span>
    <span class="s">"""Session metadata and state."""</span>
    
    <span class="n">session_id</span><span class="p">:</span> <span class="nb">str</span>  <span class="c1"># SHA-256 hash of client IP + daily salt
</span>    <span class="n">state</span><span class="p">:</span> <span class="n">SessionState</span> <span class="o">=</span> <span class="n">SessionState</span><span class="p">.</span><span class="n">CLOUD_ELIGIBLE</span>
    <span class="n">created_at</span><span class="p">:</span> <span class="n">datetime</span> <span class="o">=</span> <span class="n">field</span><span class="p">(</span><span class="n">default_factory</span><span class="o">=</span><span class="k">lambda</span><span class="p">:</span> <span class="n">datetime</span><span class="p">.</span><span class="n">now</span><span class="p">(</span><span class="n">timezone</span><span class="p">.</span><span class="n">utc</span><span class="p">))</span>
    <span class="n">last_activity</span><span class="p">:</span> <span class="n">datetime</span> <span class="o">=</span> <span class="n">field</span><span class="p">(</span><span class="n">default_factory</span><span class="o">=</span><span class="k">lambda</span><span class="p">:</span> <span class="n">datetime</span><span class="p">.</span><span class="n">now</span><span class="p">(</span><span class="n">timezone</span><span class="p">.</span><span class="n">utc</span><span class="p">))</span>
    
    <span class="c1"># Lock metadata
</span>    <span class="n">local_locked_at</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">datetime</span><span class="p">]</span> <span class="o">=</span> <span class="bp">None</span>
    <span class="n">lock_trigger_tier</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">int</span><span class="p">]</span> <span class="o">=</span> <span class="bp">None</span>
    <span class="n">lock_trigger_entities</span><span class="p">:</span> <span class="nb">list</span> <span class="o">=</span> <span class="n">field</span><span class="p">(</span><span class="n">default_factory</span><span class="o">=</span><span class="nb">list</span><span class="p">)</span>
    
    <span class="c1"># Backend stickiness (maintains round-robin consistency within session)
</span>    <span class="n">cloud_backend</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="bp">None</span>   <span class="c1"># "anthropic" or "google"
</span>    <span class="n">local_backend</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="bp">None</span>   <span class="c1"># "gemma" or "mistral"
</span>    
    <span class="k">def</span> <span class="nf">lock_to_local</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">tier</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">entities</span><span class="p">:</span> <span class="nb">list</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
        <span class="s">"""Lock session to local routing (one-way trapdoor)."""</span>
        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">state</span> <span class="o">==</span> <span class="n">SessionState</span><span class="p">.</span><span class="n">LOCAL_LOCKED</span><span class="p">:</span>
            <span class="k">return</span>  <span class="c1"># Already locked — no-op
</span>        
        <span class="bp">self</span><span class="p">.</span><span class="n">state</span> <span class="o">=</span> <span class="n">SessionState</span><span class="p">.</span><span class="n">LOCAL_LOCKED</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">local_locked_at</span> <span class="o">=</span> <span class="n">datetime</span><span class="p">.</span><span class="n">now</span><span class="p">(</span><span class="n">timezone</span><span class="p">.</span><span class="n">utc</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">lock_trigger_tier</span> <span class="o">=</span> <span class="n">tier</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">lock_trigger_entities</span> <span class="o">=</span> <span class="n">entities</span><span class="p">.</span><span class="n">copy</span><span class="p">()</span>


<span class="k">class</span> <span class="nc">SessionManager</span><span class="p">:</span>
    <span class="s">"""Manages session state with TTL expiration."""</span>
    
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">,</span>
        <span class="n">ttl_seconds</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">900</span><span class="p">,</span>           <span class="c1"># 15 minutes default
</span>        <span class="n">lock_threshold_tier</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">2</span><span class="p">,</span>      <span class="c1"># Tier 2+ triggers lock
</span>        <span class="n">buffer_max_turns</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">5</span><span class="p">,</span>
        <span class="n">buffer_max_chars</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">4000</span><span class="p">,</span>
    <span class="p">):</span>
        <span class="c1"># Sessions and buffers stored in separate TTL caches
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">_sessions</span><span class="p">:</span> <span class="n">TTLCache</span> <span class="o">=</span> <span class="n">TTLCache</span><span class="p">(</span><span class="n">maxsize</span><span class="o">=</span><span class="mi">10000</span><span class="p">,</span> <span class="n">ttl</span><span class="o">=</span><span class="n">ttl_seconds</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_buffers</span><span class="p">:</span> <span class="n">TTLCache</span> <span class="o">=</span> <span class="n">TTLCache</span><span class="p">(</span><span class="n">maxsize</span><span class="o">=</span><span class="mi">10000</span><span class="p">,</span> <span class="n">ttl</span><span class="o">=</span><span class="n">ttl_seconds</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_lock_threshold_tier</span> <span class="o">=</span> <span class="n">lock_threshold_tier</span>
    
    <span class="k">async</span> <span class="k">def</span> <span class="nf">update_session_state_async</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">,</span> 
        <span class="n">client_ip</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> 
        <span class="n">tier</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span>
        <span class="n">entities</span><span class="p">:</span> <span class="nb">list</span><span class="p">,</span>
    <span class="p">)</span> <span class="o">-&gt;</span> <span class="n">SessionInfo</span><span class="p">:</span>
        <span class="s">"""Update session state based on classification result.
        
        Applies the one-way trapdoor: if tier &gt;= threshold, locks to local.
        """</span>
        <span class="n">session</span> <span class="o">=</span> <span class="k">await</span> <span class="bp">self</span><span class="p">.</span><span class="n">get_or_create_session_async</span><span class="p">(</span><span class="n">client_ip</span><span class="p">)</span>
        
        <span class="c1"># Check if this classification triggers a lock
</span>        <span class="k">if</span> <span class="n">tier</span> <span class="o">&gt;=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_lock_threshold_tier</span> <span class="ow">and</span> <span class="ow">not</span> <span class="n">session</span><span class="p">.</span><span class="n">is_local_locked</span><span class="p">:</span>
            <span class="n">session</span><span class="p">.</span><span class="n">lock_to_local</span><span class="p">(</span><span class="n">tier</span><span class="p">,</span> <span class="n">entities</span><span class="p">)</span>
        
        <span class="k">return</span> <span class="n">session</span>
</code></pre></div></div>

<h3 id="design-decision-why-hash-session-ids-with-daily-salt">Design Decision: Why Hash Session IDs with Daily Salt?</h3>

<p><strong>Trade-off considered:</strong> Store client IPs as-is for debuggability vs. hash them for privacy.</p>

<p><strong>Why hash with rotating salt:</strong></p>
<ol>
  <li>Client IPs are PII — storing them raw creates compliance risk</li>
  <li>Daily salt rotation prevents cross-day user tracking</li>
  <li>If the session store is compromised, hashed IDs reveal nothing</li>
  <li>Lookup is O(1) either way with a good hash</li>
</ol>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">DailySalt</span><span class="p">:</span>
    <span class="s">"""Manages daily-rotating salt for session ID generation."""</span>
    
    <span class="k">def</span> <span class="nf">hash_with_salt</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">value</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
        <span class="s">"""Hash a value with current salt."""</span>
        <span class="n">combined</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">value</span><span class="si">}</span><span class="s">:</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">_current_salt</span><span class="si">}</span><span class="s">"</span>
        <span class="k">return</span> <span class="n">hashlib</span><span class="p">.</span><span class="n">sha256</span><span class="p">(</span><span class="n">combined</span><span class="p">.</span><span class="n">encode</span><span class="p">()).</span><span class="n">hexdigest</span><span class="p">()</span>


<span class="k">def</span> <span class="nf">generate_session_id</span><span class="p">(</span><span class="n">client_ip</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="s">"""Generate session ID from client IP with daily-rotating salt."""</span>
    <span class="n">salt</span> <span class="o">=</span> <span class="n">get_daily_salt</span><span class="p">()</span>
    <span class="k">return</span> <span class="n">salt</span><span class="p">.</span><span class="n">hash_with_salt</span><span class="p">(</span><span class="n">client_ip</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="design-decision-ttl-and-eviction">Design Decision: TTL and Eviction</h3>

<p><strong>Trade-off considered:</strong> Long TTL preserves sessions across browser refreshes. Short TTL reduces memory footprint.</p>

<p><strong>My choice:</strong> 15 minutes (900 seconds), configurable.</p>

<p><strong>Reasoning:</strong></p>
<ul>
  <li>Most conversations complete within 15 minutes</li>
  <li>Matches typical “idle timeout” for sensitive applications</li>
  <li>Memory overhead: ~2KB per session × 10K sessions = 20MB (acceptable)</li>
  <li>Uses <code class="language-plaintext highlighter-rouge">TTLCache</code> from <code class="language-plaintext highlighter-rouge">cachetools</code> for automatic expiration</li>
</ul>

<hr />

<h2 id="component-3-context-handoff">Component 3: Context Handoff</h2>

<h3 id="the-problem-2">The Problem</h3>

<p>When a session locks mid-conversation, the local model has no context. It doesn’t know what the user was asking about.</p>

<p>But we can’t just forward the full conversation history — it contains the very PII we’re trying to protect.</p>

<h3 id="my-solution-rolling-buffer-with-dual-bounding">My Solution: Rolling Buffer with Dual Bounding</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">@</span><span class="n">dataclass</span>
<span class="k">class</span> <span class="nc">BufferEntry</span><span class="p">:</span>
    <span class="s">"""Single interaction in the buffer."""</span>
    
    <span class="n">role</span><span class="p">:</span> <span class="nb">str</span>  <span class="c1"># "user" or "assistant"
</span>    <span class="n">content</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">timestamp</span><span class="p">:</span> <span class="n">datetime</span> <span class="o">=</span> <span class="n">field</span><span class="p">(</span><span class="n">default_factory</span><span class="o">=</span><span class="k">lambda</span><span class="p">:</span> <span class="n">datetime</span><span class="p">.</span><span class="n">now</span><span class="p">(</span><span class="n">timezone</span><span class="p">.</span><span class="n">utc</span><span class="p">))</span>
    <span class="n">tier</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">0</span>  <span class="c1"># Classification tier when added
</span>    <span class="n">scrubbed</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">False</span>  <span class="c1"># Whether content was scrubbed before storage
</span>    <span class="n">char_count</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="n">field</span><span class="p">(</span><span class="n">init</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
    
    <span class="k">def</span> <span class="nf">__post_init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">char_count</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">content</span><span class="p">)</span>


<span class="k">class</span> <span class="nc">RollingBuffer</span><span class="p">:</span>
    <span class="s">"""Rolling buffer with dual bounding.
    
    Bounded by:
    1. Max turns (number of user+assistant pairs)
    2. Max total characters (prevents massive payloads)
    
    When either limit is exceeded, oldest entries are evicted.
    """</span>
    
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">,</span>
        <span class="n">max_turns</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">5</span><span class="p">,</span>
        <span class="n">max_chars</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">4000</span><span class="p">,</span>  <span class="c1"># ~1000 tokens
</span>        <span class="n">scrub_before_store</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">True</span><span class="p">,</span>
    <span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_max_turns</span> <span class="o">=</span> <span class="n">max_turns</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_max_chars</span> <span class="o">=</span> <span class="n">max_chars</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_entries</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="n">BufferEntry</span><span class="p">]</span> <span class="o">=</span> <span class="p">[]</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_total_chars</span> <span class="o">=</span> <span class="mi">0</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_lock</span> <span class="o">=</span> <span class="n">threading</span><span class="p">.</span><span class="n">Lock</span><span class="p">()</span>
    
    <span class="k">def</span> <span class="nf">add</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">,</span>
        <span class="n">role</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
        <span class="n">content</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
        <span class="n">tier</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span>
        <span class="n">scrubbed_content</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span>
    <span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
        <span class="s">"""Add an entry, evicting oldest if limits exceeded."""</span>
        <span class="n">final_content</span> <span class="o">=</span> <span class="n">scrubbed_content</span> <span class="k">if</span> <span class="n">scrubbed_content</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span> <span class="k">else</span> <span class="n">content</span>
        <span class="n">entry</span> <span class="o">=</span> <span class="n">BufferEntry</span><span class="p">(</span><span class="n">role</span><span class="o">=</span><span class="n">role</span><span class="p">,</span> <span class="n">content</span><span class="o">=</span><span class="n">final_content</span><span class="p">,</span> <span class="n">tier</span><span class="o">=</span><span class="n">tier</span><span class="p">)</span>
        
        <span class="k">with</span> <span class="bp">self</span><span class="p">.</span><span class="n">_lock</span><span class="p">:</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">_entries</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">entry</span><span class="p">)</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">_total_chars</span> <span class="o">+=</span> <span class="n">entry</span><span class="p">.</span><span class="n">char_count</span>
            
            <span class="c1"># Enforce character limit (evict oldest until under limit)
</span>            <span class="k">while</span> <span class="bp">self</span><span class="p">.</span><span class="n">_total_chars</span> <span class="o">&gt;</span> <span class="bp">self</span><span class="p">.</span><span class="n">_max_chars</span> <span class="ow">and</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">_entries</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">1</span><span class="p">:</span>
                <span class="n">evicted</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_entries</span><span class="p">.</span><span class="n">pop</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
                <span class="bp">self</span><span class="p">.</span><span class="n">_total_chars</span> <span class="o">-=</span> <span class="n">evicted</span><span class="p">.</span><span class="n">char_count</span>
            
            <span class="c1"># Enforce turn limit (evict oldest until under limit)
</span>            <span class="k">while</span> <span class="bp">self</span><span class="p">.</span><span class="n">turn_count</span> <span class="o">&gt;</span> <span class="bp">self</span><span class="p">.</span><span class="n">_max_turns</span> <span class="ow">and</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">_entries</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">1</span><span class="p">:</span>
                <span class="n">evicted</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_entries</span><span class="p">.</span><span class="n">pop</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
                <span class="bp">self</span><span class="p">.</span><span class="n">_total_chars</span> <span class="o">-=</span> <span class="n">evicted</span><span class="p">.</span><span class="n">char_count</span>
    
    <span class="k">def</span> <span class="nf">format_for_handoff</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
        <span class="s">"""Format buffer with XML tags for injection into local model."""</span>
        <span class="k">if</span> <span class="ow">not</span> <span class="bp">self</span><span class="p">.</span><span class="n">_entries</span><span class="p">:</span>
            <span class="k">return</span> <span class="s">""</span>
        
        <span class="k">with</span> <span class="bp">self</span><span class="p">.</span><span class="n">_lock</span><span class="p">:</span>
            <span class="n">lines</span> <span class="o">=</span> <span class="p">[]</span>
            <span class="k">for</span> <span class="n">entry</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">_entries</span><span class="p">:</span>
                <span class="n">role_tag</span> <span class="o">=</span> <span class="s">"user_message"</span> <span class="k">if</span> <span class="n">entry</span><span class="p">.</span><span class="n">role</span> <span class="o">==</span> <span class="s">"user"</span> <span class="k">else</span> <span class="s">"assistant_response"</span>
                <span class="n">lines</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"&lt;</span><span class="si">{</span><span class="n">role_tag</span><span class="si">}</span><span class="s">&gt;</span><span class="si">{</span><span class="n">entry</span><span class="p">.</span><span class="n">content</span><span class="si">}</span><span class="s">&lt;/</span><span class="si">{</span><span class="n">role_tag</span><span class="si">}</span><span class="s">&gt;"</span><span class="p">)</span>
            
            <span class="k">return</span> <span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">lines</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="design-decision-dual-bounding-vs-simple-truncation">Design Decision: Dual Bounding vs. Simple Truncation</h3>

<p><strong>Trade-off considered:</strong></p>

<table>
  <thead>
    <tr>
      <th>Option</th>
      <th>Pros</th>
      <th>Cons</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Simple truncation (<code class="language-plaintext highlighter-rouge">[:max_chars]</code>)</td>
      <td>Easy to implement</td>
      <td>Cuts mid-sentence, loses coherence</td>
    </tr>
    <tr>
      <td>Turn-based only</td>
      <td>Clean turn boundaries</td>
      <td>Unbounded if turns are long</td>
    </tr>
    <tr>
      <td>Dual bounding (turns + chars)</td>
      <td>Predictable memory, clean boundaries</td>
      <td>More complex eviction logic</td>
    </tr>
  </tbody>
</table>

<p><strong>My choice:</strong> Dual bounding — limit both turns (5) and total characters (4000).</p>

<p><strong>Reasoning:</strong></p>
<ol>
  <li>Evicting oldest complete entries preserves coherence (no mid-sentence cuts)</li>
  <li>Character limit prevents a single massive turn from blowing up context</li>
  <li>Both limits are configurable per deployment</li>
  <li>~4000 chars ≈ 1000 tokens, safe for even small local models</li>
</ol>

<h3 id="design-decision-xml-tags-for-context-injection">Design Decision: XML Tags for Context Injection</h3>

<p><strong>Why XML tags instead of plain text?</strong></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Plain text (ambiguous):
User: Help me with X
Assistant: Sure, here's how...
User: Now do Y

XML-tagged (unambiguous):
&lt;user_message&gt;Help me with X&lt;/user_message&gt;
&lt;assistant_response&gt;Sure, here's how...&lt;/assistant_response&gt;
&lt;user_message&gt;Now do Y&lt;/user_message&gt;
</code></pre></div></div>

<p><strong>Reasoning:</strong> Local models (especially smaller ones) can confuse injected history with their own output. XML tags create clear structural boundaries that even 4B parameter models respect.</p>

<h3 id="pii-scrubbing-with-deterministic-placeholders">PII Scrubbing with Deterministic Placeholders</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">scrub_content_for_buffer</span><span class="p">(</span>
    <span class="n">content</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
    <span class="n">detected_entities</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">dict</span><span class="p">],</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="s">"""Scrub sensitive entities with hash-based placeholders.
    
    Creates deterministic placeholders so the same entity gets
    the same placeholder across turns (maintains referential coherence).
    """</span>
    <span class="k">if</span> <span class="ow">not</span> <span class="n">detected_entities</span><span class="p">:</span>
        <span class="k">return</span> <span class="n">content</span>
    
    <span class="n">scrubbed</span> <span class="o">=</span> <span class="n">content</span>
    
    <span class="c1"># Sort by length descending to handle overlapping matches
</span>    <span class="n">sorted_entities</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span>
        <span class="n">detected_entities</span><span class="p">,</span>
        <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">e</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">e</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"value"</span><span class="p">,</span> <span class="s">""</span><span class="p">)),</span>
        <span class="n">reverse</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
    <span class="p">)</span>
    
    <span class="k">for</span> <span class="n">entity</span> <span class="ow">in</span> <span class="n">sorted_entities</span><span class="p">:</span>
        <span class="n">value</span> <span class="o">=</span> <span class="n">entity</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"value"</span><span class="p">,</span> <span class="s">""</span><span class="p">)</span>
        <span class="n">entity_type</span> <span class="o">=</span> <span class="n">entity</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"type"</span><span class="p">,</span> <span class="s">"REDACTED"</span><span class="p">)</span>
        
        <span class="k">if</span> <span class="n">value</span> <span class="ow">and</span> <span class="n">value</span> <span class="ow">in</span> <span class="n">scrubbed</span><span class="p">:</span>
            <span class="c1"># Deterministic placeholder from hash
</span>            <span class="n">hash_suffix</span> <span class="o">=</span> <span class="n">hashlib</span><span class="p">.</span><span class="n">sha256</span><span class="p">(</span><span class="n">value</span><span class="p">.</span><span class="n">encode</span><span class="p">()).</span><span class="n">hexdigest</span><span class="p">()[:</span><span class="mi">6</span><span class="p">]</span>
            <span class="n">placeholder</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"[</span><span class="si">{</span><span class="n">entity_type</span><span class="p">.</span><span class="n">upper</span><span class="p">()</span><span class="si">}</span><span class="s">_</span><span class="si">{</span><span class="n">hash_suffix</span><span class="si">}</span><span class="s">]"</span>
            <span class="n">scrubbed</span> <span class="o">=</span> <span class="n">scrubbed</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="n">value</span><span class="p">,</span> <span class="n">placeholder</span><span class="p">)</span>
    
    <span class="k">return</span> <span class="n">scrubbed</span>
</code></pre></div></div>

<p><strong>Why hash-based placeholders?</strong></p>

<table>
  <thead>
    <tr>
      <th>Approach</th>
      <th>Example</th>
      <th>Problem</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Simple redaction</td>
      <td><code class="language-plaintext highlighter-rouge">[REDACTED]</code></td>
      <td>“Send to [REDACTED] and CC [REDACTED]” — which is which?</td>
    </tr>
    <tr>
      <td>Numbered</td>
      <td><code class="language-plaintext highlighter-rouge">[PERSON_1]</code>, <code class="language-plaintext highlighter-rouge">[PERSON_2]</code></td>
      <td>Requires state tracking across turns</td>
    </tr>
    <tr>
      <td>Hash-based</td>
      <td><code class="language-plaintext highlighter-rouge">[PERSON_a3f2c1]</code></td>
      <td>Same entity → same placeholder, stateless</td>
    </tr>
  </tbody>
</table>

<p>The hash suffix preserves referential identity: if “John Smith” appears in turns 1 and 3, both become <code class="language-plaintext highlighter-rouge">[PERSON_a3f2c1]</code>, so the local model understands they refer to the same entity.</p>

<h3 id="full-handoff-system-prompt">Full Handoff System Prompt</h3>

<p>The actual handoff injects a complete system prompt with capability guardrails:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">create_handoff_system_prompt</span><span class="p">(</span>
    <span class="nb">buffer</span><span class="p">:</span> <span class="n">RollingBuffer</span><span class="p">,</span>
    <span class="n">capability_guardrail</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">True</span><span class="p">,</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="s">"""Create system prompt for local model handoff."""</span>
    <span class="n">parts</span> <span class="o">=</span> <span class="p">[]</span>
    
    <span class="c1"># Capability guardrail (what the local model cannot do)
</span>    <span class="k">if</span> <span class="n">capability_guardrail</span><span class="p">:</span>
        <span class="n">parts</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="s">"""&lt;capability_restrictions&gt;
You are operating in a SECURE LOCAL environment with the following restrictions:
- You have NO access to the internet or web browsing
- You have NO access to external APIs or services
- You have NO access to databases or file systems
- You CANNOT make HTTP requests or fetch external data
- You MUST answer based solely on your training knowledge and the conversation context

If the user asks for anything requiring external access, politely explain 
that you cannot perform that action in this secure environment.
&lt;/capability_restrictions&gt;
"""</span><span class="p">)</span>
    
    <span class="c1"># Historical context from buffer
</span>    <span class="n">history</span> <span class="o">=</span> <span class="nb">buffer</span><span class="p">.</span><span class="n">format_for_handoff</span><span class="p">()</span>
    <span class="k">if</span> <span class="n">history</span><span class="p">:</span>
        <span class="n">parts</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"""&lt;historical_context&gt;
The following is the conversation history from this session.

</span><span class="si">{</span><span class="n">history</span><span class="si">}</span><span class="s">
&lt;/historical_context&gt;
"""</span><span class="p">)</span>
    
    <span class="c1"># Current request instructions
</span>    <span class="n">parts</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="s">"""&lt;instructions&gt;
Respond to the user's current message below. Maintain conversational 
context from the history if provided. Be helpful, accurate, and concise.
&lt;/instructions&gt;
"""</span><span class="p">)</span>
    
    <span class="k">return</span> <span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">parts</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="design-decision-capability-guardrails">Design Decision: Capability Guardrails</h3>

<p><strong>Why inject capability restrictions?</strong></p>

<p>When a session locks to local, the user might still ask “search the web for X” or “check my calendar.” Without guardrails, the local model might:</p>
<ol>
  <li>Hallucinate search results</li>
  <li>Pretend it has capabilities it doesn’t</li>
  <li>Confuse the user about what happened</li>
</ol>

<p>The capability guardrail ensures the local model responds honestly: “I can’t access the web in this secure environment.”</p>

<hr />

<h2 id="component-4-backend-manager">Component 4: Backend Manager</h2>

<h3 id="the-problem-3">The Problem</h3>

<p>I needed to support multiple backends with different selection strategies:</p>

<ul>
  <li><strong>Local:</strong> Multiple Ollama models (gemma3:4b, mistral) on the same machine</li>
  <li><strong>Cloud:</strong> Anthropic (claude-sonnet-4) and Google (gemini-2.0-flash) with configurable selection</li>
</ul>

<h3 id="architecture-separate-local-and-cloud-management">Architecture: Separate Local and Cloud Management</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">BackendManager</span><span class="p">:</span>
    <span class="s">"""Manages multiple inference backends with health checking and selection."""</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">,</span>
        <span class="n">config</span><span class="p">:</span> <span class="n">LocalBackendsConfig</span><span class="p">,</span>
        <span class="n">cloud_selection_strategy</span><span class="p">:</span> <span class="n">Literal</span><span class="p">[</span><span class="s">"primary_fallback"</span><span class="p">,</span> <span class="s">"round_robin"</span><span class="p">]</span> <span class="o">=</span> <span class="s">"primary_fallback"</span><span class="p">,</span>
        <span class="n">cloud_primary</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"anthropic"</span><span class="p">,</span>
        <span class="n">cloud_fallback</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"google"</span><span class="p">,</span>
    <span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_config</span> <span class="o">=</span> <span class="n">config</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_local_backends</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">OllamaBackend</span><span class="p">]</span> <span class="o">=</span> <span class="p">{}</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_cloud_backends</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">BaseBackend</span><span class="p">]</span> <span class="o">=</span> <span class="p">{}</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_health_status</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">bool</span><span class="p">]</span> <span class="o">=</span> <span class="p">{}</span>
        
        <span class="c1"># Cloud selection config
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">_cloud_selection_strategy</span> <span class="o">=</span> <span class="n">cloud_selection_strategy</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_cloud_primary</span> <span class="o">=</span> <span class="n">cloud_primary</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_cloud_fallback</span> <span class="o">=</span> <span class="n">cloud_fallback</span>
        
        <span class="c1"># Round-robin state (separate counters for local and cloud)
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">_cloud_rr_index</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">0</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_local_rr_index</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">0</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_lock</span> <span class="o">=</span> <span class="n">asyncio</span><span class="p">.</span><span class="n">Lock</span><span class="p">()</span>
</code></pre></div></div>

<h3 id="selection-strategies">Selection Strategies</h3>

<p><strong>Local backends</strong> support three strategies (from config):</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">async</span> <span class="k">def</span> <span class="nf">select_local_backend</span><span class="p">(</span>
    <span class="bp">self</span><span class="p">,</span>
    <span class="n">strategy</span><span class="p">:</span> <span class="n">Literal</span><span class="p">[</span><span class="s">"priority"</span><span class="p">,</span> <span class="s">"round_robin"</span><span class="p">,</span> <span class="s">"latency_best"</span><span class="p">]</span> <span class="o">|</span> <span class="bp">None</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="n">OllamaBackend</span> <span class="o">|</span> <span class="bp">None</span><span class="p">:</span>
    <span class="s">"""Select a local backend based on the configured strategy."""</span>
    <span class="n">strategy</span> <span class="o">=</span> <span class="n">strategy</span> <span class="ow">or</span> <span class="bp">self</span><span class="p">.</span><span class="n">_config</span><span class="p">.</span><span class="n">selection_strategy</span>
    <span class="n">healthy</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">get_healthy_local_backends</span><span class="p">()</span>

    <span class="k">if</span> <span class="ow">not</span> <span class="n">healthy</span><span class="p">:</span>
        <span class="k">return</span> <span class="bp">None</span>

    <span class="k">if</span> <span class="n">strategy</span> <span class="o">==</span> <span class="s">"priority"</span><span class="p">:</span>
        <span class="c1"># Sort by priority value from config, lowest wins
</span>        <span class="n">sorted_backends</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span>
            <span class="n">healthy</span><span class="p">,</span>
            <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">b</span><span class="p">:</span> <span class="nb">next</span><span class="p">(</span>
                <span class="n">e</span><span class="p">.</span><span class="n">priority</span> <span class="k">for</span> <span class="n">e</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">_config</span><span class="p">.</span><span class="n">endpoints</span> <span class="k">if</span> <span class="n">e</span><span class="p">.</span><span class="n">name</span> <span class="o">==</span> <span class="n">b</span><span class="p">.</span><span class="n">endpoint_name</span>
            <span class="p">),</span>
        <span class="p">)</span>
        <span class="k">return</span> <span class="n">sorted_backends</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>

    <span class="k">elif</span> <span class="n">strategy</span> <span class="o">==</span> <span class="s">"round_robin"</span><span class="p">:</span>
        <span class="k">async</span> <span class="k">with</span> <span class="bp">self</span><span class="p">.</span><span class="n">_lock</span><span class="p">:</span>
            <span class="n">backend</span> <span class="o">=</span> <span class="n">healthy</span><span class="p">[</span><span class="bp">self</span><span class="p">.</span><span class="n">_local_rr_index</span> <span class="o">%</span> <span class="nb">len</span><span class="p">(</span><span class="n">healthy</span><span class="p">)]</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">_local_rr_index</span> <span class="o">+=</span> <span class="mi">1</span>
            <span class="k">return</span> <span class="n">backend</span>

    <span class="k">elif</span> <span class="n">strategy</span> <span class="o">==</span> <span class="s">"latency_best"</span><span class="p">:</span>
        <span class="c1"># TODO: Implement actual latency tracking
</span>        <span class="k">return</span> <span class="n">healthy</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>

    <span class="k">return</span> <span class="n">healthy</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</code></pre></div></div>

<p><strong>Cloud backends</strong> support two strategies:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">select_cloud_backend</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">preferred</span><span class="p">:</span> <span class="nb">str</span> <span class="o">|</span> <span class="bp">None</span> <span class="o">=</span> <span class="bp">None</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">BaseBackend</span> <span class="o">|</span> <span class="bp">None</span><span class="p">:</span>
    <span class="s">"""Select a cloud backend based on configured strategy.

    Strategies:
    - primary_fallback: Try primary (anthropic), then fallback (google)
    - round_robin: Alternate between healthy backends
    """</span>
    <span class="n">healthy</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">get_healthy_cloud_backends</span><span class="p">()</span>

    <span class="k">if</span> <span class="ow">not</span> <span class="n">healthy</span><span class="p">:</span>
        <span class="k">return</span> <span class="bp">None</span>

    <span class="c1"># If preferred backend specified and healthy, use it
</span>    <span class="k">if</span> <span class="n">preferred</span> <span class="ow">and</span> <span class="n">preferred</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">_cloud_backends</span><span class="p">:</span>
        <span class="n">backend</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_cloud_backends</span><span class="p">[</span><span class="n">preferred</span><span class="p">]</span>
        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">_health_status</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">preferred</span><span class="p">,</span> <span class="bp">False</span><span class="p">):</span>
            <span class="k">return</span> <span class="n">backend</span>

    <span class="c1"># Apply selection strategy
</span>    <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">_cloud_selection_strategy</span> <span class="o">==</span> <span class="s">"round_robin"</span><span class="p">:</span>
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">_select_cloud_round_robin</span><span class="p">(</span><span class="n">healthy</span><span class="p">)</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="c1"># primary_fallback (default)
</span>        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">_select_cloud_primary_fallback</span><span class="p">(</span><span class="n">healthy</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="design-decision-why-separate-local-and-cloud-selection">Design Decision: Why Separate Local and Cloud Selection?</h3>

<p><strong>Trade-off considered:</strong></p>

<table>
  <thead>
    <tr>
      <th>Approach</th>
      <th>Pros</th>
      <th>Cons</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Unified strategy</td>
      <td>Simpler code, one enum</td>
      <td>Cloud and local have different needs</td>
    </tr>
    <tr>
      <td>Separate strategies</td>
      <td>Tailored to each tier</td>
      <td>More configuration surface</td>
    </tr>
  </tbody>
</table>

<p><strong>My choice:</strong> Separate strategies.</p>

<p><strong>Reasoning:</strong></p>
<ol>
  <li><strong>Local models are interchangeable:</strong> gemma3:4b and mistral are both “good enough” — round-robin makes sense</li>
  <li><strong>Cloud models differ significantly:</strong> Anthropic Claude excels at reasoning; Gemini excels at speed and multimodal — primary/fallback lets me choose</li>
  <li><strong>Cost considerations:</strong> Cloud round-robin might accidentally route expensive requests to the pricier provider</li>
  <li><strong>Session stickiness:</strong> Cloud backends benefit from sticky routing (consistent model personality within a session)</li>
</ol>

<h3 id="session-sticky-backend-selection">Session-Sticky Backend Selection</h3>

<p>The backend manager supports session stickiness — once a session uses a specific backend, subsequent requests prefer that backend:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">async</span> <span class="k">def</span> <span class="nf">generate_cloud</span><span class="p">(</span>
    <span class="bp">self</span><span class="p">,</span>
    <span class="n">messages</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">dict</span><span class="p">],</span>
    <span class="n">preferred_backend</span><span class="p">:</span> <span class="nb">str</span> <span class="o">|</span> <span class="bp">None</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span>
    <span class="n">sticky_backend</span><span class="p">:</span> <span class="nb">str</span> <span class="o">|</span> <span class="bp">None</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span>  <span class="c1"># From session state
</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">tuple</span><span class="p">[</span><span class="n">InferenceResult</span><span class="p">,</span> <span class="n">BaseBackend</span> <span class="o">|</span> <span class="bp">None</span><span class="p">]:</span>
    <span class="s">"""Generate using cloud with optional stickiness."""</span>
    
    <span class="c1"># Sticky backend takes precedence over preferred
</span>    <span class="n">effective_preferred</span> <span class="o">=</span> <span class="n">sticky_backend</span> <span class="ow">or</span> <span class="n">preferred_backend</span>
    
    <span class="n">backend</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">select_cloud_backend</span><span class="p">(</span><span class="n">effective_preferred</span><span class="p">)</span>
    <span class="c1"># ... rest of generation
</span></code></pre></div></div>

<p><strong>Why stickiness matters:</strong></p>
<ul>
  <li>Model personality consistency within a conversation</li>
  <li>Avoids jarring style shifts mid-conversation</li>
  <li>Round-robin still applies to <em>new</em> sessions</li>
</ul>

<h3 id="health-checking">Health Checking</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">async</span> <span class="k">def</span> <span class="nf">refresh_health</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">bool</span><span class="p">]:</span>
    <span class="s">"""Refresh health status for all backends concurrently."""</span>
    
    <span class="c1"># Build tasks for all backends
</span>    <span class="n">local_tasks</span> <span class="o">=</span> <span class="p">{</span>
        <span class="n">name</span><span class="p">:</span> <span class="n">backend</span><span class="p">.</span><span class="n">health_check</span><span class="p">()</span>
        <span class="k">for</span> <span class="n">name</span><span class="p">,</span> <span class="n">backend</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">_local_backends</span><span class="p">.</span><span class="n">items</span><span class="p">()</span>
    <span class="p">}</span>
    <span class="n">cloud_tasks</span> <span class="o">=</span> <span class="p">{</span>
        <span class="n">name</span><span class="p">:</span> <span class="n">backend</span><span class="p">.</span><span class="n">health_check</span><span class="p">()</span>
        <span class="k">for</span> <span class="n">name</span><span class="p">,</span> <span class="n">backend</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">_cloud_backends</span><span class="p">.</span><span class="n">items</span><span class="p">()</span>
    <span class="p">}</span>
    
    <span class="n">all_tasks</span> <span class="o">=</span> <span class="p">{</span><span class="o">**</span><span class="n">local_tasks</span><span class="p">,</span> <span class="o">**</span><span class="n">cloud_tasks</span><span class="p">}</span>
    
    <span class="c1"># Run all health checks concurrently
</span>    <span class="n">results</span> <span class="o">=</span> <span class="k">await</span> <span class="n">asyncio</span><span class="p">.</span><span class="n">gather</span><span class="p">(</span><span class="o">*</span><span class="n">all_tasks</span><span class="p">.</span><span class="n">values</span><span class="p">(),</span> <span class="n">return_exceptions</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
    
    <span class="k">for</span> <span class="n">name</span><span class="p">,</span> <span class="n">result</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">all_tasks</span><span class="p">.</span><span class="n">keys</span><span class="p">(),</span> <span class="n">results</span><span class="p">):</span>
        <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">result</span><span class="p">,</span> <span class="nb">Exception</span><span class="p">):</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">_health_status</span><span class="p">[</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="bp">False</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">_health_status</span><span class="p">[</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="n">result</span>
    
    <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">_health_status</span><span class="p">.</span><span class="n">copy</span><span class="p">()</span>
</code></pre></div></div>

<p>For Ollama, the health check hits <code class="language-plaintext highlighter-rouge">/api/tags</code> and verifies the configured model is available:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">async</span> <span class="k">def</span> <span class="nf">health_check</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">bool</span><span class="p">:</span>
    <span class="s">"""Check if Ollama is healthy and the model is available."""</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="n">client</span> <span class="o">=</span> <span class="k">await</span> <span class="bp">self</span><span class="p">.</span><span class="n">_get_client</span><span class="p">()</span>
        <span class="n">response</span> <span class="o">=</span> <span class="k">await</span> <span class="n">client</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"/api/tags"</span><span class="p">)</span>
        <span class="n">response</span><span class="p">.</span><span class="n">raise_for_status</span><span class="p">()</span>

        <span class="n">data</span> <span class="o">=</span> <span class="n">response</span><span class="p">.</span><span class="n">json</span><span class="p">()</span>
        <span class="n">models</span> <span class="o">=</span> <span class="p">[</span><span class="n">m</span><span class="p">[</span><span class="s">"name"</span><span class="p">]</span> <span class="k">for</span> <span class="n">m</span> <span class="ow">in</span> <span class="n">data</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"models"</span><span class="p">,</span> <span class="p">[])]</span>

        <span class="c1"># Check if our configured model is available
</span>        <span class="n">model_available</span> <span class="o">=</span> <span class="nb">any</span><span class="p">(</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">_endpoint</span><span class="p">.</span><span class="n">model</span> <span class="ow">in</span> <span class="n">m</span> <span class="ow">or</span> <span class="n">m</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">_endpoint</span><span class="p">.</span><span class="n">model</span>
            <span class="k">for</span> <span class="n">m</span> <span class="ow">in</span> <span class="n">models</span>
        <span class="p">)</span>
        <span class="k">return</span> <span class="n">model_available</span>
    <span class="k">except</span> <span class="nb">Exception</span><span class="p">:</span>
        <span class="k">return</span> <span class="bp">False</span>
</code></pre></div></div>

<h3 id="design-decision-ping-vs-inference-health-check">Design Decision: Ping vs. Inference Health Check</h3>

<p><strong>Trade-off considered:</strong></p>

<table>
  <thead>
    <tr>
      <th>Method</th>
      <th>Pros</th>
      <th>Cons</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>HTTP ping (<code class="language-plaintext highlighter-rouge">/api/tags</code>)</td>
      <td>Fast (~10ms), low overhead</td>
      <td>Doesn’t verify model is loaded in memory</td>
    </tr>
    <tr>
      <td>Minimal inference</td>
      <td>Verifies end-to-end</td>
      <td>Slow (~2s), wastes compute</td>
    </tr>
  </tbody>
</table>

<p><strong>My choice:</strong> HTTP ping to Ollama’s <code class="language-plaintext highlighter-rouge">/api/tags</code> endpoint, then verify model is in the list.</p>

<p><strong>Reasoning:</strong> Ollama keeps models in memory after first load. If the HTTP endpoint responds and lists our model, it’s ready. Full inference check would add ~2s per model per check cycle.</p>

<h3 id="cross-tier-failover">Cross-Tier Failover</h3>

<p>The backend manager supports failing over from cloud to local if all cloud backends fail:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">async</span> <span class="k">def</span> <span class="nf">generate_routed</span><span class="p">(</span>
    <span class="bp">self</span><span class="p">,</span>
    <span class="n">messages</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">dict</span><span class="p">],</span>
    <span class="n">route</span><span class="p">:</span> <span class="n">Literal</span><span class="p">[</span><span class="s">"local"</span><span class="p">,</span> <span class="s">"cloud"</span><span class="p">],</span>
    <span class="n">preferred_cloud_backend</span><span class="p">:</span> <span class="nb">str</span> <span class="o">|</span> <span class="bp">None</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span>
    <span class="n">sticky_backend</span><span class="p">:</span> <span class="nb">str</span> <span class="o">|</span> <span class="bp">None</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">tuple</span><span class="p">[</span><span class="n">InferenceResult</span><span class="p">,</span> <span class="n">BaseBackend</span> <span class="o">|</span> <span class="bp">None</span><span class="p">,</span> <span class="n">Literal</span><span class="p">[</span><span class="s">"local"</span><span class="p">,</span> <span class="s">"cloud"</span><span class="p">]]:</span>
    <span class="s">"""Generate with automatic failover."""</span>
    
    <span class="k">if</span> <span class="n">route</span> <span class="o">==</span> <span class="s">"cloud"</span><span class="p">:</span>
        <span class="n">result</span><span class="p">,</span> <span class="n">backend</span> <span class="o">=</span> <span class="k">await</span> <span class="bp">self</span><span class="p">.</span><span class="n">generate_cloud</span><span class="p">(</span>
            <span class="n">messages</span><span class="p">,</span> <span class="n">preferred_cloud_backend</span><span class="p">,</span> <span class="n">sticky_backend</span>
        <span class="p">)</span>
        
        <span class="k">if</span> <span class="n">result</span><span class="p">.</span><span class="n">error</span> <span class="ow">and</span> <span class="bp">self</span><span class="p">.</span><span class="n">_config</span><span class="p">.</span><span class="n">failover_enabled</span><span class="p">:</span>
            <span class="c1"># Fallback to local if cloud fails
</span>            <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="s">"Cloud failed, falling back to local"</span><span class="p">)</span>
            <span class="n">result</span><span class="p">,</span> <span class="n">backend</span> <span class="o">=</span> <span class="k">await</span> <span class="bp">self</span><span class="p">.</span><span class="n">generate</span><span class="p">(</span><span class="n">messages</span><span class="p">,</span> <span class="n">sticky_backend</span><span class="o">=</span><span class="n">sticky_backend</span><span class="p">)</span>
            <span class="k">return</span> <span class="n">result</span><span class="p">,</span> <span class="n">backend</span><span class="p">,</span> <span class="s">"local"</span>  <span class="c1"># Note: actual route changed
</span>        
        <span class="k">return</span> <span class="n">result</span><span class="p">,</span> <span class="n">backend</span><span class="p">,</span> <span class="s">"cloud"</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="n">result</span><span class="p">,</span> <span class="n">backend</span> <span class="o">=</span> <span class="k">await</span> <span class="bp">self</span><span class="p">.</span><span class="n">generate</span><span class="p">(</span><span class="n">messages</span><span class="p">,</span> <span class="n">sticky_backend</span><span class="o">=</span><span class="n">sticky_backend</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">result</span><span class="p">,</span> <span class="n">backend</span><span class="p">,</span> <span class="s">"local"</span>
</code></pre></div></div>

<p><strong>Why cloud → local failover?</strong></p>

<ol>
  <li><strong>Graceful degradation:</strong> If Anthropic and Google both have outages, users still get responses</li>
  <li><strong>Tier 0-1 traffic is safe for local:</strong> No PII, so local fallback doesn’t violate privacy</li>
  <li><strong>Return actual route:</strong> Caller knows the response came from local, not cloud (important for billing, metrics)</li>
</ol>

<hr />

<h2 id="component-5-shadow-mode">Component 5: Shadow Mode</h2>

<h3 id="the-problem-4">The Problem</h3>

<p>How do I know if local inference is “good enough” to replace cloud for certain traffic?</p>

<h3 id="my-solution-sequential-execution-with-background-comparison">My Solution: Sequential Execution with Background Comparison</h3>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>User Request (Tier 0 or 1)
         │
         ▼
    Cloud Backend ────────────▶ Response to User (immediate)
         │
         │ (cloud result captured)
         │
         ▼
    ┌──────────────────┐
    │ Fire-and-Forget  │
    │ Background Task  │
    └────────┬─────────┘
             │
             ▼
    Local Backend ──────────▶ Compare ──────────▶ Log Metrics
    (async, non-blocking)
</code></pre></div></div>

<p><strong>Key insight:</strong> Shadow mode is NOT truly parallel. Cloud executes first, returns to user immediately, THEN the shadow task is triggered with the cloud result passed in. This ensures zero latency impact on the user.</p>

<h3 id="implementation-1">Implementation</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">@</span><span class="n">dataclass</span>
<span class="k">class</span> <span class="nc">ShadowConfig</span><span class="p">:</span>
    <span class="s">"""Configuration for shadow mode."""</span>
    
    <span class="n">enabled</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">False</span>
    
    <span class="c1"># Which tiers to shadow (only safe tiers)
</span>    <span class="n">shadow_tiers</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">int</span><span class="p">]</span> <span class="o">=</span> <span class="n">field</span><span class="p">(</span><span class="n">default_factory</span><span class="o">=</span><span class="k">lambda</span><span class="p">:</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">])</span>
    
    <span class="c1"># Sampling rate (0.0 to 1.0)
</span>    <span class="n">sample_rate</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">1.0</span>  <span class="c1"># 100% of eligible requests
</span>    
    <span class="c1"># Similarity scoring
</span>    <span class="n">similarity_enabled</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">True</span>
    <span class="n">similarity_model</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"fast"</span>  <span class="c1"># "fast", "balanced", "accurate"
</span>    
    <span class="c1"># Timeouts
</span>    <span class="n">local_timeout_seconds</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">60.0</span>
    
    <span class="c1"># Storage
</span>    <span class="n">store_responses</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">False</span>  <span class="c1"># Memory heavy
</span>    <span class="n">max_stored_results</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">1000</span>


<span class="k">class</span> <span class="nc">ShadowRunner</span><span class="p">:</span>
    <span class="s">"""Runs shadow mode comparisons between local and cloud models.
    
    Shadow mode is non-blocking - cloud response is returned immediately
    while local inference runs in the background.
    """</span>
    
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">config</span><span class="p">:</span> <span class="n">ShadowConfig</span> <span class="o">|</span> <span class="bp">None</span> <span class="o">=</span> <span class="bp">None</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">config</span> <span class="o">=</span> <span class="n">config</span> <span class="ow">or</span> <span class="n">ShadowConfig</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_similarity</span> <span class="o">=</span> <span class="n">get_similarity_scorer</span><span class="p">()</span>
        
        <span class="c1"># Results storage (circular buffer)
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">_results</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">ShadowResult</span><span class="p">]</span> <span class="o">=</span> <span class="p">[]</span>
        
        <span class="c1"># Background tasks tracking
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">_pending_tasks</span><span class="p">:</span> <span class="nb">set</span><span class="p">[</span><span class="n">asyncio</span><span class="p">.</span><span class="n">Task</span><span class="p">]</span> <span class="o">=</span> <span class="nb">set</span><span class="p">()</span>
        
        <span class="c1"># Internal metrics
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">_total_shadows</span> <span class="o">=</span> <span class="mi">0</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_successful_shadows</span> <span class="o">=</span> <span class="mi">0</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_quality_matches</span> <span class="o">=</span> <span class="mi">0</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_total_cost_savings</span> <span class="o">=</span> <span class="mf">0.0</span>
    
    <span class="k">def</span> <span class="nf">should_shadow</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">privacy_tier</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">bool</span><span class="p">:</span>
        <span class="s">"""Determine if this request should be shadowed."""</span>
        <span class="k">if</span> <span class="ow">not</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">enabled</span><span class="p">:</span>
            <span class="k">return</span> <span class="bp">False</span>
        
        <span class="c1"># Only shadow safe tiers (0 and 1 by default)
</span>        <span class="k">if</span> <span class="n">privacy_tier</span> <span class="ow">not</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">shadow_tiers</span><span class="p">:</span>
            <span class="k">return</span> <span class="bp">False</span>
        
        <span class="c1"># Apply sampling rate
</span>        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">sample_rate</span> <span class="o">&lt;</span> <span class="mf">1.0</span><span class="p">:</span>
            <span class="kn">import</span> <span class="nn">random</span>
            <span class="k">if</span> <span class="n">random</span><span class="p">.</span><span class="n">random</span><span class="p">()</span> <span class="o">&gt;</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">sample_rate</span><span class="p">:</span>
                <span class="k">return</span> <span class="bp">False</span>
        
        <span class="k">return</span> <span class="bp">True</span>
    
    <span class="k">async</span> <span class="k">def</span> <span class="nf">run_shadow</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">,</span>
        <span class="n">request_id</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
        <span class="n">messages</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">dict</span><span class="p">],</span>
        <span class="n">cloud_result</span><span class="p">:</span> <span class="n">InferenceResult</span><span class="p">,</span>    <span class="c1"># Cloud result already available
</span>        <span class="n">cloud_backend_name</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
        <span class="n">cloud_latency_ms</span><span class="p">:</span> <span class="nb">float</span><span class="p">,</span>
        <span class="n">privacy_tier</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span>
        <span class="n">backend_manager</span><span class="p">:</span> <span class="n">BackendManager</span><span class="p">,</span>
    <span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
        <span class="s">"""Run shadow inference on local backend.
        
        This method is fire-and-forget — it schedules the shadow
        task and returns immediately.
        """</span>
        <span class="n">task</span> <span class="o">=</span> <span class="n">asyncio</span><span class="p">.</span><span class="n">create_task</span><span class="p">(</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">_run_shadow_async</span><span class="p">(</span>
                <span class="n">request_id</span><span class="o">=</span><span class="n">request_id</span><span class="p">,</span>
                <span class="n">messages</span><span class="o">=</span><span class="n">messages</span><span class="p">,</span>
                <span class="n">cloud_result</span><span class="o">=</span><span class="n">cloud_result</span><span class="p">,</span>
                <span class="n">cloud_backend_name</span><span class="o">=</span><span class="n">cloud_backend_name</span><span class="p">,</span>
                <span class="n">cloud_latency_ms</span><span class="o">=</span><span class="n">cloud_latency_ms</span><span class="p">,</span>
                <span class="n">privacy_tier</span><span class="o">=</span><span class="n">privacy_tier</span><span class="p">,</span>
                <span class="n">backend_manager</span><span class="o">=</span><span class="n">backend_manager</span><span class="p">,</span>
            <span class="p">)</span>
        <span class="p">)</span>
        
        <span class="c1"># Track task for cleanup
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">_pending_tasks</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">task</span><span class="p">)</span>
        <span class="n">task</span><span class="p">.</span><span class="n">add_done_callback</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">_pending_tasks</span><span class="p">.</span><span class="n">discard</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="design-decision-why-sequential-not-parallel">Design Decision: Why Sequential, Not Parallel?</h3>

<p><strong>Trade-off considered:</strong></p>

<table>
  <thead>
    <tr>
      <th>Approach</th>
      <th>User latency impact</th>
      <th>Resource usage</th>
      <th>Implementation</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>True parallel (both start together)</td>
      <td>+0ms (but ties up local)</td>
      <td>2x concurrent</td>
      <td>Complex coordination</td>
    </tr>
    <tr>
      <td>Sequential (cloud first, then shadow)</td>
      <td>+0ms</td>
      <td>1x then 1x</td>
      <td>Simple, clean</td>
    </tr>
  </tbody>
</table>

<p><strong>My choice:</strong> Sequential with fire-and-forget.</p>

<p><strong>Reasoning:</strong></p>
<ol>
  <li><strong>User doesn’t wait:</strong> Cloud returns immediately; shadow runs after</li>
  <li><strong>Simpler state:</strong> Shadow receives cloud result directly, no coordination needed</li>
  <li><strong>Resource efficient:</strong> Local inference only starts after cloud completes</li>
  <li><strong>Timeout is per-shadow:</strong> If local times out, we just lose that data point</li>
</ol>

<h3 id="shadow-result-structure">Shadow Result Structure</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">@</span><span class="n">dataclass</span>
<span class="k">class</span> <span class="nc">ShadowResult</span><span class="p">:</span>
    <span class="s">"""Result of a shadow mode comparison."""</span>
    
    <span class="c1"># Identifiers
</span>    <span class="n">shadow_id</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">request_id</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">timestamp</span><span class="p">:</span> <span class="nb">str</span>
    
    <span class="c1"># Models used
</span>    <span class="n">cloud_model</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">local_model</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">cloud_backend</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">local_backend</span><span class="p">:</span> <span class="nb">str</span>
    
    <span class="c1"># Latency comparison
</span>    <span class="n">cloud_latency_ms</span><span class="p">:</span> <span class="nb">float</span>
    <span class="n">local_latency_ms</span><span class="p">:</span> <span class="nb">float</span>
    <span class="n">latency_diff_ms</span><span class="p">:</span> <span class="nb">float</span>  <span class="c1"># local - cloud (negative = local faster)
</span>    
    <span class="c1"># Cost comparison
</span>    <span class="n">cloud_cost_usd</span><span class="p">:</span> <span class="nb">float</span>
    <span class="n">local_cost_usd</span><span class="p">:</span> <span class="nb">float</span>  <span class="c1"># Usually 0 for local
</span>    <span class="n">cost_savings_usd</span><span class="p">:</span> <span class="nb">float</span>
    
    <span class="c1"># Quality comparison
</span>    <span class="n">similarity</span><span class="p">:</span> <span class="n">SimilarityResult</span> <span class="o">|</span> <span class="bp">None</span> <span class="o">=</span> <span class="bp">None</span>
    
    <span class="o">@</span><span class="nb">property</span>
    <span class="k">def</span> <span class="nf">is_quality_match</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">bool</span><span class="p">:</span>
        <span class="s">"""Check if local quality matches cloud (threshold: 0.75)."""</span>
        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">similarity</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
            <span class="k">return</span> <span class="bp">False</span>
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">similarity</span><span class="p">.</span><span class="n">is_quality_match</span>
    
    <span class="o">@</span><span class="nb">property</span>
    <span class="k">def</span> <span class="nf">local_is_faster</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">bool</span><span class="p">:</span>
        <span class="s">"""Check if local was faster than cloud."""</span>
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">latency_diff_ms</span> <span class="o">&lt;</span> <span class="mi">0</span>
</code></pre></div></div>

<h3 id="similarity-computation-with-sentence-transformers">Similarity Computation with Sentence Transformers</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">SimilarityScorer</span><span class="p">:</span>
    <span class="s">"""Computes semantic similarity using sentence-transformers."""</span>
    
    <span class="c1"># Available models (speed vs accuracy tradeoff)
</span>    <span class="n">MODELS</span> <span class="o">=</span> <span class="p">{</span>
        <span class="s">"fast"</span><span class="p">:</span> <span class="s">"all-MiniLM-L6-v2"</span><span class="p">,</span>        <span class="c1"># 80MB, 384 dims
</span>        <span class="s">"balanced"</span><span class="p">:</span> <span class="s">"all-mpnet-base-v2"</span><span class="p">,</span>    <span class="c1"># 420MB, 768 dims
</span>        <span class="s">"accurate"</span><span class="p">:</span> <span class="s">"all-roberta-large-v1"</span><span class="p">,</span> <span class="c1"># 1.3GB, 1024 dims
</span>    <span class="p">}</span>
    
    <span class="c1"># Similarity interpretation thresholds
</span>    <span class="n">THRESHOLDS</span> <span class="o">=</span> <span class="p">{</span>
        <span class="s">"high"</span><span class="p">:</span> <span class="mf">0.85</span><span class="p">,</span>    <span class="c1"># Responses are very similar
</span>        <span class="s">"medium"</span><span class="p">:</span> <span class="mf">0.70</span><span class="p">,</span>  <span class="c1"># Responses convey similar meaning
</span>        <span class="s">"low"</span><span class="p">:</span> <span class="mf">0.0</span><span class="p">,</span>      <span class="c1"># Responses differ significantly
</span>    <span class="p">}</span>
    
    <span class="k">async</span> <span class="k">def</span> <span class="nf">compute_similarity</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">,</span>
        <span class="n">cloud_response</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
        <span class="n">local_response</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
    <span class="p">)</span> <span class="o">-&gt;</span> <span class="n">SimilarityResult</span><span class="p">:</span>
        <span class="s">"""Compute semantic similarity between two responses."""</span>
        
        <span class="c1"># Load model lazily
</span>        <span class="k">if</span> <span class="ow">not</span> <span class="bp">self</span><span class="p">.</span><span class="n">_initialized</span><span class="p">:</span>
            <span class="k">await</span> <span class="bp">self</span><span class="p">.</span><span class="n">initialize</span><span class="p">()</span>
        
        <span class="c1"># Compute embeddings in thread pool (CPU-bound)
</span>        <span class="n">loop</span> <span class="o">=</span> <span class="n">asyncio</span><span class="p">.</span><span class="n">get_event_loop</span><span class="p">()</span>
        <span class="n">embeddings</span> <span class="o">=</span> <span class="k">await</span> <span class="n">loop</span><span class="p">.</span><span class="n">run_in_executor</span><span class="p">(</span>
            <span class="bp">None</span><span class="p">,</span>
            <span class="k">lambda</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">_model</span><span class="p">.</span><span class="n">encode</span><span class="p">(</span>
                <span class="p">[</span><span class="n">cloud_response</span><span class="p">,</span> <span class="n">local_response</span><span class="p">],</span>
                <span class="n">convert_to_numpy</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
                <span class="n">normalize_embeddings</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>  <span class="c1"># Pre-normalize for dot product
</span>            <span class="p">)</span>
        <span class="p">)</span>
        
        <span class="c1"># Cosine similarity (embeddings are normalized, so dot product suffices)
</span>        <span class="n">similarity</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">embeddings</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">embeddings</span><span class="p">[</span><span class="mi">1</span><span class="p">]))</span>
        
        <span class="c1"># Clamp to [0, 1] and interpret
</span>        <span class="n">similarity</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="mf">0.0</span><span class="p">,</span> <span class="nb">min</span><span class="p">(</span><span class="mf">1.0</span><span class="p">,</span> <span class="n">similarity</span><span class="p">))</span>
        
        <span class="k">if</span> <span class="n">similarity</span> <span class="o">&gt;=</span> <span class="bp">self</span><span class="p">.</span><span class="n">THRESHOLDS</span><span class="p">[</span><span class="s">"high"</span><span class="p">]:</span>
            <span class="n">interpretation</span> <span class="o">=</span> <span class="s">"high"</span>
        <span class="k">elif</span> <span class="n">similarity</span> <span class="o">&gt;=</span> <span class="bp">self</span><span class="p">.</span><span class="n">THRESHOLDS</span><span class="p">[</span><span class="s">"medium"</span><span class="p">]:</span>
            <span class="n">interpretation</span> <span class="o">=</span> <span class="s">"medium"</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="n">interpretation</span> <span class="o">=</span> <span class="s">"low"</span>
        
        <span class="k">return</span> <span class="n">SimilarityResult</span><span class="p">(</span>
            <span class="n">similarity_score</span><span class="o">=</span><span class="n">similarity</span><span class="p">,</span>
            <span class="n">interpretation</span><span class="o">=</span><span class="n">interpretation</span><span class="p">,</span>
            <span class="c1"># ... other fields
</span>        <span class="p">)</span>
</code></pre></div></div>

<h3 id="design-decision-why-sentence-transformers">Design Decision: Why Sentence Transformers?</h3>

<p><strong>Trade-off considered:</strong></p>

<table>
  <thead>
    <tr>
      <th>Approach</th>
      <th>Speed</th>
      <th>Quality</th>
      <th>Dependencies</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Jaccard similarity</td>
      <td>&lt;1ms</td>
      <td>Poor (surface only)</td>
      <td>None</td>
    </tr>
    <tr>
      <td>TF-IDF + cosine</td>
      <td>~5ms</td>
      <td>Moderate</td>
      <td>sklearn</td>
    </tr>
    <tr>
      <td>Sentence transformers</td>
      <td>~50ms</td>
      <td>Excellent (semantic)</td>
      <td>torch, transformers</td>
    </tr>
  </tbody>
</table>

<p><strong>My choice:</strong> Sentence transformers with model tiers.</p>

<p><strong>Reasoning:</strong></p>
<ol>
  <li><strong>Semantic similarity matters:</strong> “The cat sat on the mat” and “A feline rested on the rug” should score high</li>
  <li><strong>Model flexibility:</strong> “fast” (80MB) for development, “accurate” (1.3GB) for production validation</li>
  <li><strong>Lazy loading:</strong> Model only loads when first shadow comparison runs</li>
  <li><strong>Thread pool execution:</strong> Embeddings computed off the event loop to avoid blocking</li>
</ol>

<h3 id="quality-match-threshold">Quality Match Threshold</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">@</span><span class="nb">property</span>
<span class="k">def</span> <span class="nf">is_quality_match</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">bool</span><span class="p">:</span>
    <span class="s">"""Check if responses are semantically similar enough."""</span>
    <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">similarity_score</span> <span class="o">&gt;=</span> <span class="mf">0.75</span>
</code></pre></div></div>

<p><strong>Why 0.75?</strong></p>

<table>
  <thead>
    <tr>
      <th>Threshold</th>
      <th>Meaning</th>
      <th>Risk</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>0.90+</td>
      <td>Near-identical</td>
      <td>Too strict, almost never matches</td>
    </tr>
    <tr>
      <td>0.80-0.90</td>
      <td>Very similar</td>
      <td>Good for production validation</td>
    </tr>
    <tr>
      <td>0.70-0.80</td>
      <td>Similar meaning</td>
      <td>Acceptable for most use cases</td>
    </tr>
    <tr>
      <td>&lt;0.70</td>
      <td>Different responses</td>
      <td>Quality concern</td>
    </tr>
  </tbody>
</table>

<p>I chose 0.75 as a balance — strict enough to catch quality regressions, loose enough to tolerate stylistic differences between models.</p>

<h3 id="internal-metrics-for-controller">Internal Metrics for Controller</h3>

<p>The ShadowRunner maintains internal metrics that the controller reads directly (no Prometheus dependency):</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">get_metrics</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">dict</span><span class="p">:</span>
    <span class="s">"""Get shadow mode metrics for controller."""</span>
    <span class="n">quality_rate</span> <span class="o">=</span> <span class="p">(</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_quality_matches</span> <span class="o">/</span> <span class="bp">self</span><span class="p">.</span><span class="n">_successful_shadows</span>
        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">_successful_shadows</span> <span class="o">&gt;</span> <span class="mi">0</span> <span class="k">else</span> <span class="mf">0.0</span>
    <span class="p">)</span>
    
    <span class="k">return</span> <span class="p">{</span>
        <span class="s">"total_shadows"</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">_total_shadows</span><span class="p">,</span>
        <span class="s">"successful_shadows"</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">_successful_shadows</span><span class="p">,</span>
        <span class="s">"quality_matches"</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">_quality_matches</span><span class="p">,</span>
        <span class="s">"quality_match_rate"</span><span class="p">:</span> <span class="nb">round</span><span class="p">(</span><span class="n">quality_rate</span><span class="p">,</span> <span class="mi">4</span><span class="p">),</span>
        <span class="s">"total_cost_savings_usd"</span><span class="p">:</span> <span class="nb">round</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">_total_cost_savings</span><span class="p">,</span> <span class="mi">4</span><span class="p">),</span>
        <span class="s">"pending_tasks"</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">_pending_tasks</span><span class="p">),</span>
        <span class="s">"stored_results"</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">_results</span><span class="p">),</span>
    <span class="p">}</span>
</code></pre></div></div>

<hr />

<h2 id="component-6-closed-loop-controller">Component 6: Closed-Loop Controller</h2>

<h3 id="the-problem-5">The Problem</h3>

<p>Shadow mode generates data. But who acts on it?</p>

<p>Manual review doesn’t scale. I wanted a system that could observe patterns and generate routing recommendations automatically.</p>

<h3 id="architecture">Architecture</h3>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>┌─────────────────────────────────────────────────────────────────┐
│                    Closed-Loop Controller                        │
│                                                                  │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐          │
│  │   Metrics   │    │    Rule     │    │ Recommend-  │          │
│  │   Reader    │───▶│   Engine    │───▶│   ations    │          │
│  │             │    │             │    │             │          │
│  │ • Read from │    │ • Evaluate  │    │ • Generate  │          │
│  │   Shadow    │    │   per-tier  │    │ • Log       │          │
│  │   Runner    │    │ • Threshold │    │ • (Act)*    │          │
│  │   directly  │    │ • Drift     │    │             │          │
│  └─────────────┘    └─────────────┘    └─────────────┘          │
│                                                                  │
│  *Act is disabled in "observe" mode (v1)                         │
│  No Prometheus dependency — pure Python state access             │
└──────────────────────────────────────────────────────────────────┘
</code></pre></div></div>

<p><strong>Key design point:</strong> The controller reads metrics directly from <code class="language-plaintext highlighter-rouge">ShadowRunner._results</code>, NOT from Prometheus. This eliminates an external dependency and simplifies deployment.</p>

<h3 id="metrics-reader-direct-state-access">Metrics Reader: Direct State Access</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">MetricsReader</span><span class="p">:</span>
    <span class="s">"""Reads and aggregates metrics from ShadowRunner internal state.
    
    No Prometheus dependency — pure Python state access.
    Uses timestamped filtering for rolling window calculations.
    """</span>
    
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">window_seconds</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">300</span><span class="p">):</span>
        <span class="s">"""Initialize metrics reader.
        
        Args:
            window_seconds: Rolling window size in seconds (default: 5 min)
        """</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_window_seconds</span> <span class="o">=</span> <span class="n">window_seconds</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_shadow_runner</span><span class="p">:</span> <span class="n">ShadowRunner</span> <span class="o">|</span> <span class="bp">None</span> <span class="o">=</span> <span class="bp">None</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_previous_tier_metrics</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">int</span><span class="p">,</span> <span class="n">TierMetrics</span><span class="p">]</span> <span class="o">=</span> <span class="p">{}</span>
    
    <span class="k">def</span> <span class="nf">set_shadow_runner</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">runner</span><span class="p">:</span> <span class="n">ShadowRunner</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
        <span class="s">"""Set the shadow runner to read metrics from."""</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_shadow_runner</span> <span class="o">=</span> <span class="n">runner</span>
    
    <span class="k">def</span> <span class="nf">get_tier_metrics</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">,</span>
        <span class="n">tier</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span>
        <span class="n">window_seconds</span><span class="p">:</span> <span class="nb">int</span> <span class="o">|</span> <span class="bp">None</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span>
        <span class="n">quality_threshold</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">0.85</span><span class="p">,</span>
    <span class="p">)</span> <span class="o">-&gt;</span> <span class="n">TierMetrics</span><span class="p">:</span>
        <span class="s">"""Get aggregated metrics for a specific tier."""</span>
        <span class="n">window</span> <span class="o">=</span> <span class="n">window_seconds</span> <span class="ow">or</span> <span class="bp">self</span><span class="p">.</span><span class="n">_window_seconds</span>
        <span class="n">cutoff</span> <span class="o">=</span> <span class="n">datetime</span><span class="p">.</span><span class="n">now</span><span class="p">(</span><span class="n">timezone</span><span class="p">.</span><span class="n">utc</span><span class="p">)</span> <span class="o">-</span> <span class="n">timedelta</span><span class="p">(</span><span class="n">seconds</span><span class="o">=</span><span class="n">window</span><span class="p">)</span>
        
        <span class="c1"># Read directly from shadow runner's internal results
</span>        <span class="n">samples</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_collect_samples</span><span class="p">(</span><span class="n">tier</span><span class="p">,</span> <span class="n">cutoff</span><span class="p">)</span>
        
        <span class="k">if</span> <span class="ow">not</span> <span class="n">samples</span><span class="p">:</span>
            <span class="k">return</span> <span class="n">TierMetrics</span><span class="p">(</span><span class="n">tier</span><span class="o">=</span><span class="n">tier</span><span class="p">)</span>
        
        <span class="c1"># Calculate statistics
</span>        <span class="n">similarities</span> <span class="o">=</span> <span class="p">[</span><span class="n">s</span><span class="p">.</span><span class="n">similarity_score</span> <span class="k">for</span> <span class="n">s</span> <span class="ow">in</span> <span class="n">samples</span><span class="p">]</span>
        
        <span class="k">return</span> <span class="n">TierMetrics</span><span class="p">(</span>
            <span class="n">tier</span><span class="o">=</span><span class="n">tier</span><span class="p">,</span>
            <span class="n">sample_count</span><span class="o">=</span><span class="nb">len</span><span class="p">(</span><span class="n">samples</span><span class="p">),</span>
            <span class="n">avg_similarity</span><span class="o">=</span><span class="n">mean</span><span class="p">(</span><span class="n">similarities</span><span class="p">),</span>
            <span class="n">min_similarity</span><span class="o">=</span><span class="nb">min</span><span class="p">(</span><span class="n">similarities</span><span class="p">),</span>
            <span class="n">max_similarity</span><span class="o">=</span><span class="nb">max</span><span class="p">(</span><span class="n">similarities</span><span class="p">),</span>
            <span class="n">p50_similarity</span><span class="o">=</span><span class="n">median</span><span class="p">(</span><span class="n">similarities</span><span class="p">),</span>
            <span class="n">quality_match_rate</span><span class="o">=</span><span class="nb">len</span><span class="p">([</span><span class="n">s</span> <span class="k">for</span> <span class="n">s</span> <span class="ow">in</span> <span class="n">samples</span> <span class="k">if</span> <span class="n">s</span><span class="p">.</span><span class="n">similarity_score</span> <span class="o">&gt;=</span> <span class="n">quality_threshold</span><span class="p">])</span> <span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="n">samples</span><span class="p">),</span>
            <span class="c1"># ... other fields
</span>        <span class="p">)</span>
    
    <span class="k">def</span> <span class="nf">_collect_samples</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">tier</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">cutoff</span><span class="p">:</span> <span class="n">datetime</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="n">MetricsSample</span><span class="p">]:</span>
        <span class="s">"""Collect samples by reading ShadowRunner._results directly."""</span>
        <span class="k">if</span> <span class="ow">not</span> <span class="bp">self</span><span class="p">.</span><span class="n">_shadow_runner</span><span class="p">:</span>
            <span class="k">return</span> <span class="p">[]</span>
        
        <span class="n">samples</span> <span class="o">=</span> <span class="p">[]</span>
        <span class="n">results</span> <span class="o">=</span> <span class="nb">getattr</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">_shadow_runner</span><span class="p">,</span> <span class="s">"_results"</span><span class="p">,</span> <span class="p">[])</span>
        
        <span class="k">for</span> <span class="n">result</span> <span class="ow">in</span> <span class="n">results</span><span class="p">:</span>
            <span class="n">result_time</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_parse_timestamp</span><span class="p">(</span><span class="n">result</span><span class="p">.</span><span class="n">timestamp</span><span class="p">)</span>
            <span class="k">if</span> <span class="n">result_time</span> <span class="ow">and</span> <span class="n">result_time</span> <span class="o">&gt;=</span> <span class="n">cutoff</span> <span class="ow">and</span> <span class="n">result</span><span class="p">.</span><span class="n">privacy_tier</span> <span class="o">==</span> <span class="n">tier</span><span class="p">:</span>
                <span class="n">samples</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">MetricsSample</span><span class="p">(</span>
                    <span class="n">tier</span><span class="o">=</span><span class="n">result</span><span class="p">.</span><span class="n">privacy_tier</span><span class="p">,</span>
                    <span class="n">similarity_score</span><span class="o">=</span><span class="n">result</span><span class="p">.</span><span class="n">similarity</span><span class="p">.</span><span class="n">similarity_score</span> <span class="k">if</span> <span class="n">result</span><span class="p">.</span><span class="n">similarity</span> <span class="k">else</span> <span class="mf">0.0</span><span class="p">,</span>
                    <span class="n">latency_diff_ms</span><span class="o">=</span><span class="n">result</span><span class="p">.</span><span class="n">latency_diff_ms</span><span class="p">,</span>
                    <span class="n">cost_savings_usd</span><span class="o">=</span><span class="n">result</span><span class="p">.</span><span class="n">cost_savings_usd</span><span class="p">,</span>
                    <span class="n">is_quality_match</span><span class="o">=</span><span class="n">result</span><span class="p">.</span><span class="n">is_quality_match</span><span class="p">,</span>
                    <span class="n">timestamp</span><span class="o">=</span><span class="n">result_time</span><span class="p">,</span>
                <span class="p">))</span>
        
        <span class="k">return</span> <span class="n">samples</span>
</code></pre></div></div>

<h3 id="design-decision-why-not-prometheus">Design Decision: Why Not Prometheus?</h3>

<p><strong>Trade-off considered:</strong></p>

<table>
  <thead>
    <tr>
      <th>Approach</th>
      <th>Pros</th>
      <th>Cons</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Query Prometheus</td>
      <td>Industry standard, dashboards built-in</td>
      <td>External dependency, network latency</td>
    </tr>
    <tr>
      <td>Read ShadowRunner directly</td>
      <td>Zero dependencies, faster, simpler</td>
      <td>Tighter coupling, no persistence</td>
    </tr>
  </tbody>
</table>

<p><strong>My choice:</strong> Direct state access.</p>

<p><strong>Reasoning:</strong></p>
<ol>
  <li><strong>Simplicity:</strong> Controller runs in same process as ShadowRunner — no network hop</li>
  <li><strong>Deployment:</strong> One less service to configure and maintain</li>
  <li><strong>Latency:</strong> Sub-millisecond access vs. Prometheus query latency</li>
  <li><strong>Portfolio scope:</strong> Demonstrating the pattern is sufficient; production might add Prometheus for persistence</li>
</ol>

<h3 id="rule-engine">Rule Engine</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">RecommendationType</span><span class="p">(</span><span class="n">Enum</span><span class="p">):</span>
    <span class="s">"""Types of routing recommendations."""</span>
    
    <span class="n">ROUTE_TO_LOCAL</span> <span class="o">=</span> <span class="s">"route_to_local"</span>       <span class="c1"># Quality good, save money
</span>    <span class="n">KEEP_ON_CLOUD</span> <span class="o">=</span> <span class="s">"keep_on_cloud"</span>         <span class="c1"># Quality insufficient
</span>    <span class="n">DRIFT_ALERT</span> <span class="o">=</span> <span class="s">"drift_alert"</span>             <span class="c1"># Quality degraded from previous
</span>    <span class="n">INSUFFICIENT_DATA</span> <span class="o">=</span> <span class="s">"insufficient_data"</span>  <span class="c1"># Need more samples
</span>    <span class="n">NO_CHANGE</span> <span class="o">=</span> <span class="s">"no_change"</span>                  <span class="c1"># Current config is optimal
</span>

<span class="k">class</span> <span class="nc">Confidence</span><span class="p">(</span><span class="n">Enum</span><span class="p">):</span>
    <span class="s">"""Confidence level for recommendations."""</span>
    
    <span class="n">HIGH</span> <span class="o">=</span> <span class="s">"high"</span>      <span class="c1"># &gt;500 samples, stable metrics
</span>    <span class="n">MEDIUM</span> <span class="o">=</span> <span class="s">"medium"</span>  <span class="c1"># 100-500 samples
</span>    <span class="n">LOW</span> <span class="o">=</span> <span class="s">"low"</span>        <span class="c1"># &lt;100 samples
</span>

<span class="k">class</span> <span class="nc">RuleEngine</span><span class="p">:</span>
    <span class="s">"""Evaluates metrics against rules to generate recommendations."""</span>
    
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">config</span><span class="p">:</span> <span class="n">ControllerConfig</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_config</span> <span class="o">=</span> <span class="n">config</span>
    
    <span class="k">def</span> <span class="nf">evaluate</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">,</span>
        <span class="n">current</span><span class="p">:</span> <span class="n">TierMetrics</span><span class="p">,</span>
        <span class="n">previous</span><span class="p">:</span> <span class="n">TierMetrics</span> <span class="o">|</span> <span class="bp">None</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span>
    <span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Recommendation</span><span class="p">:</span>
        <span class="s">"""Evaluate metrics for a tier and generate recommendation."""</span>
        
        <span class="n">ctx</span> <span class="o">=</span> <span class="n">RuleContext</span><span class="p">(</span>
            <span class="n">current_metrics</span><span class="o">=</span><span class="n">current</span><span class="p">,</span>
            <span class="n">previous_metrics</span><span class="o">=</span><span class="n">previous</span><span class="p">,</span>
            <span class="n">config</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">_config</span><span class="p">,</span>
            <span class="n">tier</span><span class="o">=</span><span class="n">current</span><span class="p">.</span><span class="n">tier</span><span class="p">,</span>
        <span class="p">)</span>
        
        <span class="c1"># Rule priority: check in order
</span>        
        <span class="c1"># 1. Insufficient data
</span>        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">_check_insufficient_data</span><span class="p">(</span><span class="n">ctx</span><span class="p">):</span>
            <span class="n">min_samples</span> <span class="o">=</span> <span class="n">ctx</span><span class="p">.</span><span class="n">threshold_config</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"min_samples"</span><span class="p">,</span> <span class="mi">100</span><span class="p">)</span>
            <span class="k">return</span> <span class="n">Recommendation</span><span class="p">(</span>
                <span class="n">tier</span><span class="o">=</span><span class="n">ctx</span><span class="p">.</span><span class="n">tier</span><span class="p">,</span>
                <span class="n">recommendation</span><span class="o">=</span><span class="n">RecommendationType</span><span class="p">.</span><span class="n">INSUFFICIENT_DATA</span><span class="p">,</span>
                <span class="n">reason</span><span class="o">=</span><span class="sa">f</span><span class="s">"Only </span><span class="si">{</span><span class="n">current</span><span class="p">.</span><span class="n">sample_count</span><span class="si">}</span><span class="s"> samples, need </span><span class="si">{</span><span class="n">min_samples</span><span class="si">}</span><span class="s">"</span><span class="p">,</span>
                <span class="n">confidence</span><span class="o">=</span><span class="n">Confidence</span><span class="p">.</span><span class="n">LOW</span><span class="p">,</span>
            <span class="p">)</span>
        
        <span class="c1"># 2. Drift alert (quality degradation)
</span>        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">_check_drift</span><span class="p">(</span><span class="n">ctx</span><span class="p">):</span>
            <span class="n">delta</span> <span class="o">=</span> <span class="n">previous</span><span class="p">.</span><span class="n">avg_similarity</span> <span class="o">-</span> <span class="n">current</span><span class="p">.</span><span class="n">avg_similarity</span>
            <span class="k">return</span> <span class="n">Recommendation</span><span class="p">(</span>
                <span class="n">tier</span><span class="o">=</span><span class="n">ctx</span><span class="p">.</span><span class="n">tier</span><span class="p">,</span>
                <span class="n">recommendation</span><span class="o">=</span><span class="n">RecommendationType</span><span class="p">.</span><span class="n">DRIFT_ALERT</span><span class="p">,</span>
                <span class="n">reason</span><span class="o">=</span><span class="sa">f</span><span class="s">"Quality degraded by </span><span class="si">{</span><span class="n">delta</span><span class="si">:</span><span class="p">.</span><span class="mi">1</span><span class="o">%</span><span class="si">}</span><span class="s">"</span><span class="p">,</span>
                <span class="n">confidence</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">_get_confidence</span><span class="p">(</span><span class="n">current</span><span class="p">.</span><span class="n">sample_count</span><span class="p">),</span>
                <span class="n">previous_similarity</span><span class="o">=</span><span class="n">previous</span><span class="p">.</span><span class="n">avg_similarity</span><span class="p">,</span>
                <span class="n">similarity_delta</span><span class="o">=-</span><span class="n">delta</span><span class="p">,</span>
            <span class="p">)</span>
        
        <span class="c1"># 3. Route to local (quality good + cost savings)
</span>        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">_check_route_to_local</span><span class="p">(</span><span class="n">ctx</span><span class="p">):</span>
            <span class="k">return</span> <span class="n">Recommendation</span><span class="p">(</span>
                <span class="n">tier</span><span class="o">=</span><span class="n">ctx</span><span class="p">.</span><span class="n">tier</span><span class="p">,</span>
                <span class="n">recommendation</span><span class="o">=</span><span class="n">RecommendationType</span><span class="p">.</span><span class="n">ROUTE_TO_LOCAL</span><span class="p">,</span>
                <span class="n">reason</span><span class="o">=</span><span class="sa">f</span><span class="s">"Similarity </span><span class="si">{</span><span class="n">current</span><span class="p">.</span><span class="n">avg_similarity</span><span class="si">:</span><span class="p">.</span><span class="mi">1</span><span class="o">%</span><span class="si">}</span><span class="s"> exceeds threshold"</span><span class="p">,</span>
                <span class="n">confidence</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">_get_confidence</span><span class="p">(</span><span class="n">current</span><span class="p">.</span><span class="n">sample_count</span><span class="p">),</span>
                <span class="n">potential_savings_usd</span><span class="o">=</span><span class="n">current</span><span class="p">.</span><span class="n">total_cost_savings_usd</span><span class="p">,</span>
            <span class="p">)</span>
        
        <span class="c1"># 4. Keep on cloud (quality insufficient)
</span>        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">_check_keep_on_cloud</span><span class="p">(</span><span class="n">ctx</span><span class="p">):</span>
            <span class="n">min_similarity</span> <span class="o">=</span> <span class="n">ctx</span><span class="p">.</span><span class="n">threshold_config</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"min_similarity"</span><span class="p">,</span> <span class="mf">0.85</span><span class="p">)</span>
            <span class="n">gap</span> <span class="o">=</span> <span class="n">min_similarity</span> <span class="o">-</span> <span class="n">current</span><span class="p">.</span><span class="n">avg_similarity</span>
            <span class="k">return</span> <span class="n">Recommendation</span><span class="p">(</span>
                <span class="n">tier</span><span class="o">=</span><span class="n">ctx</span><span class="p">.</span><span class="n">tier</span><span class="p">,</span>
                <span class="n">recommendation</span><span class="o">=</span><span class="n">RecommendationType</span><span class="p">.</span><span class="n">KEEP_ON_CLOUD</span><span class="p">,</span>
                <span class="n">reason</span><span class="o">=</span><span class="sa">f</span><span class="s">"Similarity </span><span class="si">{</span><span class="n">current</span><span class="p">.</span><span class="n">avg_similarity</span><span class="si">:</span><span class="p">.</span><span class="mi">1</span><span class="o">%</span><span class="si">}</span><span class="s"> is </span><span class="si">{</span><span class="n">gap</span><span class="si">:</span><span class="p">.</span><span class="mi">1</span><span class="o">%</span><span class="si">}</span><span class="s"> below threshold"</span><span class="p">,</span>
                <span class="n">confidence</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">_get_confidence</span><span class="p">(</span><span class="n">current</span><span class="p">.</span><span class="n">sample_count</span><span class="p">),</span>
            <span class="p">)</span>
        
        <span class="c1"># 5. No change needed
</span>        <span class="k">return</span> <span class="n">Recommendation</span><span class="p">(</span>
            <span class="n">tier</span><span class="o">=</span><span class="n">ctx</span><span class="p">.</span><span class="n">tier</span><span class="p">,</span>
            <span class="n">recommendation</span><span class="o">=</span><span class="n">RecommendationType</span><span class="p">.</span><span class="n">NO_CHANGE</span><span class="p">,</span>
            <span class="n">reason</span><span class="o">=</span><span class="s">"Current configuration is optimal"</span><span class="p">,</span>
            <span class="n">confidence</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">_get_confidence</span><span class="p">(</span><span class="n">current</span><span class="p">.</span><span class="n">sample_count</span><span class="p">),</span>
        <span class="p">)</span>
    
    <span class="k">def</span> <span class="nf">_get_confidence</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">sample_count</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Confidence</span><span class="p">:</span>
        <span class="s">"""Determine confidence level based on sample count."""</span>
        <span class="k">if</span> <span class="n">sample_count</span> <span class="o">&gt;=</span> <span class="mi">500</span><span class="p">:</span>
            <span class="k">return</span> <span class="n">Confidence</span><span class="p">.</span><span class="n">HIGH</span>
        <span class="k">elif</span> <span class="n">sample_count</span> <span class="o">&gt;=</span> <span class="mi">100</span><span class="p">:</span>
            <span class="k">return</span> <span class="n">Confidence</span><span class="p">.</span><span class="n">MEDIUM</span>
        <span class="k">return</span> <span class="n">Confidence</span><span class="p">.</span><span class="n">LOW</span>
</code></pre></div></div>

<h3 id="controller-configuration">Controller Configuration</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">@</span><span class="n">dataclass</span> 
<span class="k">class</span> <span class="nc">ControllerConfig</span><span class="p">:</span>
    <span class="s">"""Configuration for the closed-loop controller."""</span>
    
    <span class="n">enabled</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">False</span>
    <span class="n">mode</span><span class="p">:</span> <span class="n">Literal</span><span class="p">[</span><span class="s">"observe"</span><span class="p">,</span> <span class="s">"auto"</span><span class="p">]</span> <span class="o">=</span> <span class="s">"observe"</span>
    <span class="n">evaluation_interval_seconds</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">60</span>
    <span class="n">window_seconds</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">300</span>  <span class="c1"># 5 minute rolling window
</span>    
    <span class="c1"># Per-tier thresholds (different tiers can have different quality bars)
</span>    <span class="n">tier_thresholds</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">int</span><span class="p">,</span> <span class="nb">dict</span><span class="p">]</span> <span class="o">=</span> <span class="n">field</span><span class="p">(</span><span class="n">default_factory</span><span class="o">=</span><span class="k">lambda</span><span class="p">:</span> <span class="p">{</span>
        <span class="mi">0</span><span class="p">:</span> <span class="p">{</span><span class="s">"min_similarity"</span><span class="p">:</span> <span class="mf">0.85</span><span class="p">,</span> <span class="s">"min_samples"</span><span class="p">:</span> <span class="mi">100</span><span class="p">},</span>
        <span class="mi">1</span><span class="p">:</span> <span class="p">{</span><span class="s">"min_similarity"</span><span class="p">:</span> <span class="mf">0.80</span><span class="p">,</span> <span class="s">"min_samples"</span><span class="p">:</span> <span class="mi">100</span><span class="p">},</span>
    <span class="p">})</span>
    
    <span class="c1"># Alert thresholds
</span>    <span class="n">drift_threshold</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">0.10</span>  <span class="c1"># Alert if similarity drops by 10%
</span>    <span class="n">cost_savings_threshold_usd</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">50.0</span>  <span class="c1"># Min savings to recommend
</span></code></pre></div></div>

<h3 id="controller-lifecycle">Controller Lifecycle</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">ClosedLoopController</span><span class="p">:</span>
    <span class="s">"""Closed-loop controller for routing optimization.
    
    Runs as asyncio background task within FastAPI lifecycle.
    """</span>
    
    <span class="k">async</span> <span class="k">def</span> <span class="nf">start</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
        <span class="s">"""Start the controller background task."""</span>
        <span class="k">if</span> <span class="ow">not</span> <span class="bp">self</span><span class="p">.</span><span class="n">_config</span><span class="p">.</span><span class="n">enabled</span><span class="p">:</span>
            <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="s">"Controller disabled, not starting"</span><span class="p">)</span>
            <span class="k">return</span>
        
        <span class="bp">self</span><span class="p">.</span><span class="n">_running</span> <span class="o">=</span> <span class="bp">True</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_task</span> <span class="o">=</span> <span class="n">asyncio</span><span class="p">.</span><span class="n">create_task</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">_run_loop</span><span class="p">())</span>
    
    <span class="k">async</span> <span class="k">def</span> <span class="nf">_run_loop</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
        <span class="s">"""Main controller loop — runs evaluation at configured interval."""</span>
        <span class="k">while</span> <span class="bp">self</span><span class="p">.</span><span class="n">_running</span><span class="p">:</span>
            <span class="k">try</span><span class="p">:</span>
                <span class="k">await</span> <span class="bp">self</span><span class="p">.</span><span class="n">_evaluate</span><span class="p">()</span>
            <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
                <span class="n">logger</span><span class="p">.</span><span class="n">error</span><span class="p">(</span><span class="s">"Controller evaluation failed"</span><span class="p">,</span> <span class="n">error</span><span class="o">=</span><span class="nb">str</span><span class="p">(</span><span class="n">e</span><span class="p">))</span>
            
            <span class="k">await</span> <span class="n">asyncio</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">_config</span><span class="p">.</span><span class="n">evaluation_interval_seconds</span><span class="p">)</span>
    
    <span class="k">async</span> <span class="k">def</span> <span class="nf">_evaluate</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
        <span class="s">"""Run a single evaluation cycle."""</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_total_evaluations</span> <span class="o">+=</span> <span class="mi">1</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_last_evaluation</span> <span class="o">=</span> <span class="n">datetime</span><span class="p">.</span><span class="n">utcnow</span><span class="p">()</span>
        
        <span class="c1"># Collect metrics for all tiers
</span>        <span class="n">tier_metrics</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_metrics_reader</span><span class="p">.</span><span class="n">get_all_tier_metrics</span><span class="p">()</span>
        
        <span class="k">if</span> <span class="ow">not</span> <span class="n">tier_metrics</span><span class="p">:</span>
            <span class="n">logger</span><span class="p">.</span><span class="n">debug</span><span class="p">(</span><span class="s">"No shadow metrics available for evaluation"</span><span class="p">)</span>
            <span class="k">return</span>
        
        <span class="c1"># Evaluate each tier
</span>        <span class="k">for</span> <span class="n">tier</span><span class="p">,</span> <span class="n">metrics</span> <span class="ow">in</span> <span class="n">tier_metrics</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
            <span class="n">previous</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_metrics_reader</span><span class="p">.</span><span class="n">get_previous_metrics</span><span class="p">(</span><span class="n">tier</span><span class="p">)</span>
            <span class="n">recommendation</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_rule_engine</span><span class="p">.</span><span class="n">evaluate</span><span class="p">(</span><span class="n">metrics</span><span class="p">,</span> <span class="n">previous</span><span class="p">)</span>
            
            <span class="c1"># Log recommendation (structured for Loki)
</span>            <span class="bp">self</span><span class="p">.</span><span class="n">_log_recommendation</span><span class="p">(</span><span class="n">recommendation</span><span class="p">,</span> <span class="n">metrics</span><span class="p">)</span>
        
        <span class="c1"># Store current metrics for next evaluation's drift detection
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">_metrics_reader</span><span class="p">.</span><span class="n">store_current_as_previous</span><span class="p">(</span><span class="n">tier_metrics</span><span class="p">)</span>
    
    <span class="k">async</span> <span class="k">def</span> <span class="nf">force_evaluate</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">dict</span><span class="p">:</span>
        <span class="s">"""Force an immediate evaluation (for testing/debugging)."""</span>
        <span class="k">await</span> <span class="bp">self</span><span class="p">.</span><span class="n">_evaluate</span><span class="p">()</span>
        <span class="k">return</span> <span class="p">{</span>
            <span class="s">"evaluation_number"</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">_total_evaluations</span><span class="p">,</span>
            <span class="s">"recommendations"</span><span class="p">:</span> <span class="p">{</span>
                <span class="n">tier</span><span class="p">:</span> <span class="n">rec</span><span class="p">.</span><span class="n">to_dict</span><span class="p">()</span> 
                <span class="k">for</span> <span class="n">tier</span><span class="p">,</span> <span class="n">rec</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">_current_recommendations</span><span class="p">.</span><span class="n">items</span><span class="p">()</span>
            <span class="p">},</span>
        <span class="p">}</span>
</code></pre></div></div>

<h3 id="design-decision-observe-vs-auto-mode">Design Decision: Observe vs Auto Mode</h3>

<p><strong>Trade-off considered:</strong></p>

<table>
  <thead>
    <tr>
      <th>Mode</th>
      <th>Behavior</th>
      <th>Risk</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Observe</td>
      <td>Log recommendations only</td>
      <td>None (human reviews)</td>
    </tr>
    <tr>
      <td>Auto</td>
      <td>Adjust routing weights automatically</td>
      <td>Runaway feedback loops</td>
    </tr>
  </tbody>
</table>

<p><strong>My choice:</strong> Observe mode only for v1.</p>

<p><strong>Reasoning:</strong> Auto-routing is powerful but dangerous. A bug in similarity computation could cause the controller to route all traffic to local, degrading user experience. For a portfolio project, demonstrating the architecture is sufficient. Production deployment would require:</p>
<ol>
  <li>Extensive testing of the feedback loop</li>
  <li>Circuit breakers to prevent runaway changes</li>
  <li>Gradual rollout (route 1% to local, measure, increase)</li>
  <li>Human approval gates for significant changes</li>
</ol>

<hr />

<h2 id="component-7-observability-stack">Component 7: Observability Stack</h2>

<h3 id="architecture-three-pillars">Architecture: Three Pillars</h3>

<p>The gateway implements all three observability pillars:</p>

<table>
  <thead>
    <tr>
      <th>Pillar</th>
      <th>Library</th>
      <th>Export Target</th>
      <th>Purpose</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Metrics</strong></td>
      <td><code class="language-plaintext highlighter-rouge">prometheus_client</code></td>
      <td>Prometheus</td>
      <td>Counters, histograms, gauges</td>
    </tr>
    <tr>
      <td><strong>Traces</strong></td>
      <td>OpenTelemetry</td>
      <td>Tempo (OTLP)</td>
      <td>Request flow, latency breakdown</td>
    </tr>
    <tr>
      <td><strong>Logs</strong></td>
      <td><code class="language-plaintext highlighter-rouge">structlog</code></td>
      <td>Loki (JSON)</td>
      <td>Events, debugging</td>
    </tr>
  </tbody>
</table>

<p><strong>Note:</strong> Metrics use <code class="language-plaintext highlighter-rouge">prometheus_client</code> (Prometheus-native), NOT OpenTelemetry. This is intentional — Prometheus client is more mature and Grafana dashboards work seamlessly with it.</p>

<h3 id="metrics-prometheus-native">Metrics: Prometheus-Native</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">prometheus_client</span> <span class="kn">import</span> <span class="n">Counter</span><span class="p">,</span> <span class="n">Histogram</span><span class="p">,</span> <span class="n">Gauge</span><span class="p">,</span> <span class="n">Info</span>

<span class="c1"># =============================================================================
# Request Metrics
# =============================================================================
</span><span class="n">REQUESTS_TOTAL</span> <span class="o">=</span> <span class="n">Counter</span><span class="p">(</span>
    <span class="s">"sentinel_requests_total"</span><span class="p">,</span>
    <span class="s">"Total number of inference requests"</span><span class="p">,</span>
    <span class="n">labelnames</span><span class="o">=</span><span class="p">[</span><span class="s">"route"</span><span class="p">,</span> <span class="s">"backend"</span><span class="p">,</span> <span class="s">"endpoint"</span><span class="p">,</span> <span class="s">"model"</span><span class="p">,</span> <span class="s">"tier"</span><span class="p">,</span> <span class="s">"status"</span><span class="p">]</span>
<span class="p">)</span>

<span class="n">REQUESTS_IN_PROGRESS</span> <span class="o">=</span> <span class="n">Gauge</span><span class="p">(</span>
    <span class="s">"sentinel_requests_in_progress"</span><span class="p">,</span>
    <span class="s">"Number of requests currently being processed"</span><span class="p">,</span>
    <span class="n">labelnames</span><span class="o">=</span><span class="p">[</span><span class="s">"backend"</span><span class="p">]</span>
<span class="p">)</span>

<span class="c1"># =============================================================================
# Latency Metrics (LLM-specific histograms)
# =============================================================================
</span><span class="n">LATENCY_BUCKETS_SEC</span> <span class="o">=</span> <span class="p">[</span><span class="mf">0.01</span><span class="p">,</span> <span class="mf">0.025</span><span class="p">,</span> <span class="mf">0.05</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.25</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">,</span> <span class="mf">2.5</span><span class="p">,</span> <span class="mf">5.0</span><span class="p">,</span> <span class="mf">10.0</span><span class="p">,</span> <span class="mf">30.0</span><span class="p">,</span> <span class="mf">60.0</span><span class="p">]</span>

<span class="n">TTFT_SECONDS</span> <span class="o">=</span> <span class="n">Histogram</span><span class="p">(</span>
    <span class="s">"sentinel_ttft_seconds"</span><span class="p">,</span>
    <span class="s">"Time to first token in seconds"</span><span class="p">,</span>
    <span class="n">labelnames</span><span class="o">=</span><span class="p">[</span><span class="s">"backend"</span><span class="p">,</span> <span class="s">"endpoint"</span><span class="p">,</span> <span class="s">"model"</span><span class="p">],</span>
    <span class="n">buckets</span><span class="o">=</span><span class="n">LATENCY_BUCKETS_SEC</span>
<span class="p">)</span>

<span class="n">ITL_SECONDS</span> <span class="o">=</span> <span class="n">Histogram</span><span class="p">(</span>
    <span class="s">"sentinel_itl_seconds"</span><span class="p">,</span>
    <span class="s">"Inter-token latency in seconds"</span><span class="p">,</span>
    <span class="n">labelnames</span><span class="o">=</span><span class="p">[</span><span class="s">"backend"</span><span class="p">,</span> <span class="s">"endpoint"</span><span class="p">,</span> <span class="s">"model"</span><span class="p">],</span>
    <span class="n">buckets</span><span class="o">=</span><span class="p">[</span><span class="mf">0.005</span><span class="p">,</span> <span class="mf">0.010</span><span class="p">,</span> <span class="mf">0.020</span><span class="p">,</span> <span class="mf">0.030</span><span class="p">,</span> <span class="mf">0.050</span><span class="p">,</span> <span class="mf">0.075</span><span class="p">,</span> <span class="mf">0.100</span><span class="p">,</span> <span class="mf">0.150</span><span class="p">,</span> <span class="mf">0.200</span><span class="p">,</span> <span class="mf">0.500</span><span class="p">]</span>
<span class="p">)</span>

<span class="n">TPOT_SECONDS</span> <span class="o">=</span> <span class="n">Histogram</span><span class="p">(</span>
    <span class="s">"sentinel_tpot_seconds"</span><span class="p">,</span>
    <span class="s">"Time per output token in seconds"</span><span class="p">,</span>
    <span class="n">labelnames</span><span class="o">=</span><span class="p">[</span><span class="s">"backend"</span><span class="p">,</span> <span class="s">"endpoint"</span><span class="p">,</span> <span class="s">"model"</span><span class="p">],</span>
    <span class="n">buckets</span><span class="o">=</span><span class="p">[</span><span class="mf">0.005</span><span class="p">,</span> <span class="mf">0.010</span><span class="p">,</span> <span class="mf">0.020</span><span class="p">,</span> <span class="mf">0.030</span><span class="p">,</span> <span class="mf">0.050</span><span class="p">,</span> <span class="mf">0.075</span><span class="p">,</span> <span class="mf">0.100</span><span class="p">,</span> <span class="mf">0.150</span><span class="p">,</span> <span class="mf">0.200</span><span class="p">,</span> <span class="mf">0.500</span><span class="p">]</span>
<span class="p">)</span>

<span class="n">CLASSIFICATION_LATENCY_SECONDS</span> <span class="o">=</span> <span class="n">Histogram</span><span class="p">(</span>
    <span class="s">"sentinel_classification_latency_seconds"</span><span class="p">,</span>
    <span class="s">"Privacy classification latency in seconds"</span><span class="p">,</span>
    <span class="n">labelnames</span><span class="o">=</span><span class="p">[</span><span class="s">"detection_method"</span><span class="p">],</span>
    <span class="n">buckets</span><span class="o">=</span><span class="p">[</span><span class="mf">0.0001</span><span class="p">,</span> <span class="mf">0.0005</span><span class="p">,</span> <span class="mf">0.001</span><span class="p">,</span> <span class="mf">0.005</span><span class="p">,</span> <span class="mf">0.01</span><span class="p">,</span> <span class="mf">0.05</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">]</span>
<span class="p">)</span>

<span class="c1"># =============================================================================
# Shadow Mode Metrics
# =============================================================================
</span><span class="n">SHADOW_REQUESTS_TOTAL</span> <span class="o">=</span> <span class="n">Counter</span><span class="p">(</span>
    <span class="s">"sentinel_shadow_requests_total"</span><span class="p">,</span>
    <span class="s">"Total shadow mode comparisons"</span><span class="p">,</span>
    <span class="n">labelnames</span><span class="o">=</span><span class="p">[</span><span class="s">"status"</span><span class="p">]</span>  <span class="c1"># success, timeout, error
</span><span class="p">)</span>

<span class="n">SHADOW_SIMILARITY_SCORE</span> <span class="o">=</span> <span class="n">Histogram</span><span class="p">(</span>
    <span class="s">"sentinel_shadow_similarity_score"</span><span class="p">,</span>
    <span class="s">"Semantic similarity score between cloud and local outputs"</span><span class="p">,</span>
    <span class="n">buckets</span><span class="o">=</span><span class="p">[</span><span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.2</span><span class="p">,</span> <span class="mf">0.3</span><span class="p">,</span> <span class="mf">0.4</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">,</span> <span class="mf">0.6</span><span class="p">,</span> <span class="mf">0.7</span><span class="p">,</span> <span class="mf">0.75</span><span class="p">,</span> <span class="mf">0.8</span><span class="p">,</span> <span class="mf">0.85</span><span class="p">,</span> <span class="mf">0.9</span><span class="p">,</span> <span class="mf">0.95</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">]</span>
<span class="p">)</span>

<span class="n">SHADOW_QUALITY_MATCH</span> <span class="o">=</span> <span class="n">Counter</span><span class="p">(</span>
    <span class="s">"sentinel_shadow_quality_match_total"</span><span class="p">,</span>
    <span class="s">"Shadow comparisons where local quality matched cloud"</span><span class="p">,</span>
    <span class="n">labelnames</span><span class="o">=</span><span class="p">[</span><span class="s">"tier"</span><span class="p">]</span>
<span class="p">)</span>

<span class="c1"># =============================================================================
# Cost Metrics
# =============================================================================
</span><span class="n">COST_USD_TOTAL</span> <span class="o">=</span> <span class="n">Counter</span><span class="p">(</span>
    <span class="s">"sentinel_cost_usd_total"</span><span class="p">,</span>
    <span class="s">"Total inference cost in USD"</span><span class="p">,</span>
    <span class="n">labelnames</span><span class="o">=</span><span class="p">[</span><span class="s">"backend"</span><span class="p">,</span> <span class="s">"model"</span><span class="p">]</span>
<span class="p">)</span>

<span class="n">COST_SAVINGS_USD_TOTAL</span> <span class="o">=</span> <span class="n">Counter</span><span class="p">(</span>
    <span class="s">"sentinel_cost_savings_usd_total"</span><span class="p">,</span>
    <span class="s">"Total cost savings from local routing in USD"</span>
<span class="p">)</span>
</code></pre></div></div>

<h3 id="helper-functions-for-clean-recording">Helper Functions for Clean Recording</h3>

<p>Rather than scattering metric updates throughout the codebase, I centralized them:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">record_request</span><span class="p">(</span>
    <span class="n">route</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
    <span class="n">backend</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
    <span class="n">endpoint</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
    <span class="n">model</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
    <span class="n">tier</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span>
    <span class="n">status</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"success"</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
    <span class="s">"""Record a completed request."""</span>
    <span class="n">REQUESTS_TOTAL</span><span class="p">.</span><span class="n">labels</span><span class="p">(</span>
        <span class="n">route</span><span class="o">=</span><span class="n">route</span><span class="p">,</span>
        <span class="n">backend</span><span class="o">=</span><span class="n">backend</span><span class="p">,</span>
        <span class="n">endpoint</span><span class="o">=</span><span class="n">endpoint</span><span class="p">,</span>
        <span class="n">model</span><span class="o">=</span><span class="n">model</span><span class="p">,</span>
        <span class="n">tier</span><span class="o">=</span><span class="nb">str</span><span class="p">(</span><span class="n">tier</span><span class="p">),</span>
        <span class="n">status</span><span class="o">=</span><span class="n">status</span>
    <span class="p">).</span><span class="n">inc</span><span class="p">()</span>


<span class="k">def</span> <span class="nf">record_latencies</span><span class="p">(</span>
    <span class="n">backend</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
    <span class="n">endpoint</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
    <span class="n">model</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
    <span class="n">ttft_ms</span><span class="p">:</span> <span class="nb">float</span> <span class="o">|</span> <span class="bp">None</span><span class="p">,</span>
    <span class="n">itl_ms</span><span class="p">:</span> <span class="nb">float</span> <span class="o">|</span> <span class="bp">None</span><span class="p">,</span>
    <span class="n">tpot_ms</span><span class="p">:</span> <span class="nb">float</span> <span class="o">|</span> <span class="bp">None</span><span class="p">,</span>
    <span class="n">total_ms</span><span class="p">:</span> <span class="nb">float</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
    <span class="s">"""Record all latency metrics for a request."""</span>
    <span class="n">labels</span> <span class="o">=</span> <span class="p">{</span><span class="s">"backend"</span><span class="p">:</span> <span class="n">backend</span><span class="p">,</span> <span class="s">"endpoint"</span><span class="p">:</span> <span class="n">endpoint</span><span class="p">,</span> <span class="s">"model"</span><span class="p">:</span> <span class="n">model</span><span class="p">}</span>
    
    <span class="k">if</span> <span class="n">ttft_ms</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span>
        <span class="n">TTFT_SECONDS</span><span class="p">.</span><span class="n">labels</span><span class="p">(</span><span class="o">**</span><span class="n">labels</span><span class="p">).</span><span class="n">observe</span><span class="p">(</span><span class="n">ttft_ms</span> <span class="o">/</span> <span class="mi">1000</span><span class="p">)</span>
    
    <span class="k">if</span> <span class="n">itl_ms</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span>
        <span class="n">ITL_SECONDS</span><span class="p">.</span><span class="n">labels</span><span class="p">(</span><span class="o">**</span><span class="n">labels</span><span class="p">).</span><span class="n">observe</span><span class="p">(</span><span class="n">itl_ms</span> <span class="o">/</span> <span class="mi">1000</span><span class="p">)</span>
    
    <span class="k">if</span> <span class="n">tpot_ms</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span>
        <span class="n">TPOT_SECONDS</span><span class="p">.</span><span class="n">labels</span><span class="p">(</span><span class="o">**</span><span class="n">labels</span><span class="p">).</span><span class="n">observe</span><span class="p">(</span><span class="n">tpot_ms</span> <span class="o">/</span> <span class="mi">1000</span><span class="p">)</span>
    
    <span class="n">INFERENCE_LATENCY_SECONDS</span><span class="p">.</span><span class="n">labels</span><span class="p">(</span><span class="o">**</span><span class="n">labels</span><span class="p">).</span><span class="n">observe</span><span class="p">(</span><span class="n">total_ms</span> <span class="o">/</span> <span class="mi">1000</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">record_shadow_result</span><span class="p">(</span>
    <span class="n">status</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
    <span class="n">tier</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span>
    <span class="n">similarity_score</span><span class="p">:</span> <span class="nb">float</span> <span class="o">|</span> <span class="bp">None</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span>
    <span class="n">latency_diff_ms</span><span class="p">:</span> <span class="nb">float</span> <span class="o">|</span> <span class="bp">None</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span>
    <span class="n">cost_savings_usd</span><span class="p">:</span> <span class="nb">float</span> <span class="o">|</span> <span class="bp">None</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span>
    <span class="n">is_quality_match</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">False</span><span class="p">,</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
    <span class="s">"""Record shadow mode comparison results."""</span>
    <span class="n">SHADOW_REQUESTS_TOTAL</span><span class="p">.</span><span class="n">labels</span><span class="p">(</span><span class="n">status</span><span class="o">=</span><span class="n">status</span><span class="p">).</span><span class="n">inc</span><span class="p">()</span>
    
    <span class="k">if</span> <span class="n">status</span> <span class="o">==</span> <span class="s">"success"</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">similarity_score</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span>
            <span class="n">SHADOW_SIMILARITY_SCORE</span><span class="p">.</span><span class="n">observe</span><span class="p">(</span><span class="n">similarity_score</span><span class="p">)</span>
        
        <span class="k">if</span> <span class="n">is_quality_match</span><span class="p">:</span>
            <span class="n">SHADOW_QUALITY_MATCH</span><span class="p">.</span><span class="n">labels</span><span class="p">(</span><span class="n">tier</span><span class="o">=</span><span class="nb">str</span><span class="p">(</span><span class="n">tier</span><span class="p">)).</span><span class="n">inc</span><span class="p">()</span>
</code></pre></div></div>

<h3 id="traces-opentelemetry-to-tempo">Traces: OpenTelemetry to Tempo</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">opentelemetry</span> <span class="kn">import</span> <span class="n">trace</span>
<span class="kn">from</span> <span class="nn">opentelemetry.exporter.otlp.proto.grpc.trace_exporter</span> <span class="kn">import</span> <span class="n">OTLPSpanExporter</span>
<span class="kn">from</span> <span class="nn">opentelemetry.sdk.trace</span> <span class="kn">import</span> <span class="n">TracerProvider</span>
<span class="kn">from</span> <span class="nn">opentelemetry.sdk.trace.export</span> <span class="kn">import</span> <span class="n">BatchSpanProcessor</span>

<span class="k">def</span> <span class="nf">setup_tracing</span><span class="p">(</span>
    <span class="n">service_name</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"inference-sentinel"</span><span class="p">,</span>
    <span class="n">otlp_endpoint</span><span class="p">:</span> <span class="nb">str</span> <span class="o">|</span> <span class="bp">None</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
    <span class="s">"""Initialize OpenTelemetry tracing."""</span>
    
    <span class="n">resource</span> <span class="o">=</span> <span class="n">Resource</span><span class="p">.</span><span class="n">create</span><span class="p">({</span>
        <span class="n">SERVICE_NAME</span><span class="p">:</span> <span class="n">service_name</span><span class="p">,</span>
        <span class="s">"service.namespace"</span><span class="p">:</span> <span class="s">"inference-sentinel"</span><span class="p">,</span>
    <span class="p">})</span>
    
    <span class="n">provider</span> <span class="o">=</span> <span class="n">TracerProvider</span><span class="p">(</span><span class="n">resource</span><span class="o">=</span><span class="n">resource</span><span class="p">)</span>
    
    <span class="c1"># Export to Tempo via OTLP
</span>    <span class="k">if</span> <span class="n">otlp_endpoint</span><span class="p">:</span>
        <span class="n">otlp_exporter</span> <span class="o">=</span> <span class="n">OTLPSpanExporter</span><span class="p">(</span><span class="n">endpoint</span><span class="o">=</span><span class="n">otlp_endpoint</span><span class="p">,</span> <span class="n">insecure</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
        <span class="n">provider</span><span class="p">.</span><span class="n">add_span_processor</span><span class="p">(</span><span class="n">BatchSpanProcessor</span><span class="p">(</span><span class="n">otlp_exporter</span><span class="p">))</span>
    
    <span class="n">trace</span><span class="p">.</span><span class="n">set_tracer_provider</span><span class="p">(</span><span class="n">provider</span><span class="p">)</span>


<span class="c1"># Context manager for spans
</span><span class="o">@</span><span class="n">contextmanager</span>
<span class="k">def</span> <span class="nf">trace_span</span><span class="p">(</span>
    <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
    <span class="n">attributes</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Any</span><span class="p">]</span> <span class="o">|</span> <span class="bp">None</span> <span class="o">=</span> <span class="bp">None</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Generator</span><span class="p">[</span><span class="n">Span</span><span class="p">,</span> <span class="bp">None</span><span class="p">,</span> <span class="bp">None</span><span class="p">]:</span>
    <span class="s">"""Create a traced span context."""</span>
    <span class="n">tracer</span> <span class="o">=</span> <span class="n">get_tracer</span><span class="p">()</span>
    <span class="k">with</span> <span class="n">tracer</span><span class="p">.</span><span class="n">start_as_current_span</span><span class="p">(</span><span class="n">name</span><span class="p">)</span> <span class="k">as</span> <span class="n">span</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">attributes</span><span class="p">:</span>
            <span class="k">for</span> <span class="n">key</span><span class="p">,</span> <span class="n">value</span> <span class="ow">in</span> <span class="n">attributes</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
                <span class="k">if</span> <span class="n">value</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span>
                    <span class="n">span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">value</span><span class="p">)</span>
        <span class="k">yield</span> <span class="n">span</span>


<span class="c1"># Decorator for easy function tracing
</span><span class="k">def</span> <span class="nf">traced</span><span class="p">(</span><span class="n">name</span><span class="p">:</span> <span class="nb">str</span> <span class="o">|</span> <span class="bp">None</span> <span class="o">=</span> <span class="bp">None</span><span class="p">):</span>
    <span class="s">"""Decorator to trace a function."""</span>
    <span class="k">def</span> <span class="nf">decorator</span><span class="p">(</span><span class="n">func</span><span class="p">):</span>
        <span class="n">span_name</span> <span class="o">=</span> <span class="n">name</span> <span class="ow">or</span> <span class="n">func</span><span class="p">.</span><span class="n">__name__</span>
        
        <span class="k">async</span> <span class="k">def</span> <span class="nf">async_wrapper</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
            <span class="k">with</span> <span class="n">trace_span</span><span class="p">(</span><span class="n">span_name</span><span class="p">)</span> <span class="k">as</span> <span class="n">span</span><span class="p">:</span>
                <span class="k">try</span><span class="p">:</span>
                    <span class="n">result</span> <span class="o">=</span> <span class="k">await</span> <span class="n">func</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">)</span>
                    <span class="n">span</span><span class="p">.</span><span class="n">set_status</span><span class="p">(</span><span class="n">Status</span><span class="p">(</span><span class="n">StatusCode</span><span class="p">.</span><span class="n">OK</span><span class="p">))</span>
                    <span class="k">return</span> <span class="n">result</span>
                <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
                    <span class="n">span</span><span class="p">.</span><span class="n">set_status</span><span class="p">(</span><span class="n">Status</span><span class="p">(</span><span class="n">StatusCode</span><span class="p">.</span><span class="n">ERROR</span><span class="p">,</span> <span class="nb">str</span><span class="p">(</span><span class="n">e</span><span class="p">)))</span>
                    <span class="n">span</span><span class="p">.</span><span class="n">record_exception</span><span class="p">(</span><span class="n">e</span><span class="p">)</span>
                    <span class="k">raise</span>
        
        <span class="c1"># ... sync wrapper similar
</span>        <span class="k">return</span> <span class="n">async_wrapper</span> <span class="k">if</span> <span class="n">asyncio</span><span class="p">.</span><span class="n">iscoroutinefunction</span><span class="p">(</span><span class="n">func</span><span class="p">)</span> <span class="k">else</span> <span class="n">sync_wrapper</span>
    <span class="k">return</span> <span class="n">decorator</span>
</code></pre></div></div>

<h3 id="logs-structured-with-structlog">Logs: Structured with structlog</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">structlog</span>

<span class="k">def</span> <span class="nf">setup_logging</span><span class="p">(</span>
    <span class="n">log_level</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"INFO"</span><span class="p">,</span>
    <span class="n">json_logs</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">True</span><span class="p">,</span>  <span class="c1"># JSON for Loki, console for dev
</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
    <span class="s">"""Configure structured logging."""</span>
    
    <span class="n">processors</span> <span class="o">=</span> <span class="p">[</span>
        <span class="n">structlog</span><span class="p">.</span><span class="n">stdlib</span><span class="p">.</span><span class="n">add_logger_name</span><span class="p">,</span>
        <span class="n">add_log_level</span><span class="p">,</span>
        <span class="n">add_timestamp</span><span class="p">,</span>
        <span class="n">structlog</span><span class="p">.</span><span class="n">processors</span><span class="p">.</span><span class="n">StackInfoRenderer</span><span class="p">(),</span>
        <span class="n">structlog</span><span class="p">.</span><span class="n">processors</span><span class="p">.</span><span class="n">format_exc_info</span><span class="p">,</span>
    <span class="p">]</span>
    
    <span class="k">if</span> <span class="n">json_logs</span><span class="p">:</span>
        <span class="n">processors</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">structlog</span><span class="p">.</span><span class="n">processors</span><span class="p">.</span><span class="n">JSONRenderer</span><span class="p">())</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="n">processors</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">structlog</span><span class="p">.</span><span class="n">dev</span><span class="p">.</span><span class="n">ConsoleRenderer</span><span class="p">(</span><span class="n">colors</span><span class="o">=</span><span class="bp">True</span><span class="p">))</span>
    
    <span class="n">structlog</span><span class="p">.</span><span class="n">configure</span><span class="p">(</span>
        <span class="n">processors</span><span class="o">=</span><span class="n">processors</span><span class="p">,</span>
        <span class="n">wrapper_class</span><span class="o">=</span><span class="n">structlog</span><span class="p">.</span><span class="n">stdlib</span><span class="p">.</span><span class="n">BoundLogger</span><span class="p">,</span>
        <span class="n">logger_factory</span><span class="o">=</span><span class="n">structlog</span><span class="p">.</span><span class="n">stdlib</span><span class="p">.</span><span class="n">LoggerFactory</span><span class="p">(),</span>
    <span class="p">)</span>
</code></pre></div></div>

<h3 id="helper-logger-classes">Helper Logger Classes</h3>

<p>Domain-specific loggers for cleaner call sites:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">InferenceLogger</span><span class="p">:</span>
    <span class="s">"""Helper for logging inference-related events."""</span>
    
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">logger_name</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"sentinel.inference"</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">logger</span> <span class="o">=</span> <span class="n">get_logger</span><span class="p">(</span><span class="n">logger_name</span><span class="p">)</span>
    
    <span class="k">def</span> <span class="nf">request_started</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">,</span>
        <span class="n">request_id</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
        <span class="n">route</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
        <span class="n">backend</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
        <span class="n">tier</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span>
        <span class="n">tier_label</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
        <span class="n">entities</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span>
    <span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
        <span class="s">"""Log when an inference request starts."""</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span>
            <span class="s">"Inference request started"</span><span class="p">,</span>
            <span class="n">request_id</span><span class="o">=</span><span class="n">request_id</span><span class="p">,</span>
            <span class="n">route</span><span class="o">=</span><span class="n">route</span><span class="p">,</span>
            <span class="n">backend</span><span class="o">=</span><span class="n">backend</span><span class="p">,</span>
            <span class="n">privacy_tier</span><span class="o">=</span><span class="n">tier</span><span class="p">,</span>
            <span class="n">privacy_tier_label</span><span class="o">=</span><span class="n">tier_label</span><span class="p">,</span>
            <span class="n">entities_detected</span><span class="o">=</span><span class="n">entities</span>
        <span class="p">)</span>
    
    <span class="k">def</span> <span class="nf">request_completed</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">,</span>
        <span class="n">request_id</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
        <span class="n">backend</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
        <span class="n">model</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
        <span class="n">total_tokens</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span>
        <span class="n">latency_ms</span><span class="p">:</span> <span class="nb">float</span><span class="p">,</span>
        <span class="n">cost_usd</span><span class="p">:</span> <span class="nb">float</span><span class="p">,</span>
    <span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
        <span class="s">"""Log when an inference request completes."""</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span>
            <span class="s">"Inference request completed"</span><span class="p">,</span>
            <span class="n">request_id</span><span class="o">=</span><span class="n">request_id</span><span class="p">,</span>
            <span class="n">backend</span><span class="o">=</span><span class="n">backend</span><span class="p">,</span>
            <span class="n">model</span><span class="o">=</span><span class="n">model</span><span class="p">,</span>
            <span class="n">total_tokens</span><span class="o">=</span><span class="n">total_tokens</span><span class="p">,</span>
            <span class="n">latency_ms</span><span class="o">=</span><span class="nb">round</span><span class="p">(</span><span class="n">latency_ms</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span>
            <span class="n">cost_usd</span><span class="o">=</span><span class="nb">round</span><span class="p">(</span><span class="n">cost_usd</span><span class="p">,</span> <span class="mi">6</span><span class="p">),</span>
        <span class="p">)</span>


<span class="k">class</span> <span class="nc">ClassificationLogger</span><span class="p">:</span>
    <span class="s">"""Helper for logging classification events."""</span>
    
    <span class="k">def</span> <span class="nf">sensitive_content_detected</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">,</span>
        <span class="n">tier</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span>
        <span class="n">tier_label</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
        <span class="n">entities</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">dict</span><span class="p">]</span>
    <span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
        <span class="s">"""Log when sensitive content is detected (tier &gt;= 2)."""</span>
        <span class="k">if</span> <span class="n">tier</span> <span class="o">&gt;=</span> <span class="mi">2</span><span class="p">:</span>
            <span class="c1"># Don't log the actual content, just metadata
</span>            <span class="bp">self</span><span class="p">.</span><span class="n">logger</span><span class="p">.</span><span class="n">warning</span><span class="p">(</span>
                <span class="s">"Sensitive content detected"</span><span class="p">,</span>
                <span class="n">privacy_tier</span><span class="o">=</span><span class="n">tier</span><span class="p">,</span>
                <span class="n">privacy_tier_label</span><span class="o">=</span><span class="n">tier_label</span><span class="p">,</span>
                <span class="n">entity_types</span><span class="o">=</span><span class="p">[</span><span class="n">e</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"entity_type"</span><span class="p">)</span> <span class="k">for</span> <span class="n">e</span> <span class="ow">in</span> <span class="n">entities</span><span class="p">],</span>
                <span class="n">action</span><span class="o">=</span><span class="s">"routing_to_local"</span>
            <span class="p">)</span>
</code></pre></div></div>

<h3 id="the-grafana-dashboards">The Grafana Dashboards</h3>

<p>Two dashboards provide operational visibility:</p>

<p><img src="/assets/images/inference-sentinel/generic-graphana02.png" alt="Overview Dashboard" />
<em>Real-time monitoring: request rate, TTFT/ITL by model, route distribution, backend health, and cost tracking</em></p>

<p><strong>Operational health</strong></p>
<ul>
  <li><strong>Backend Health</strong>: gemma and mistral both healthy</li>
  <li><strong>Route Distribution</strong>: 70% local, 30% cloud</li>
  <li><strong>Request Share by Model</strong>: Traffic distribution across all backends</li>
  <li>Per-model TTFT, ITL, and TPOT</li>
  <li>Cost accumulation</li>
</ul>

<p><img src="/assets/images/inference-sentinel/generic-graphana01.png" alt="Controller Dashboard" />
<em>Controller Decisions Data: Similarity Score, Shadow Comparison, Cost Savings and Routing Recommendations</em></p>

<p><strong>Controller Dashboard</strong> — ML/quality metrics</p>
<ul>
  <li>Similarity score trends</li>
  <li>Shadow comparison counts</li>
  <li>Latency differential (local - cloud)</li>
  <li>Cost savings over time</li>
  <li>Routing recommendations log</li>
</ul>

<h3 id="design-decision-metric-cardinality">Design Decision: Metric Cardinality</h3>

<p><strong>Trade-off considered:</strong> More labels = more granular data, but also more time series (cardinality explosion).</p>

<p><strong>My approach:</strong></p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">route</code>: 2 values (local, cloud)</li>
  <li><code class="language-plaintext highlighter-rouge">backend</code>: 4 values (ollama, anthropic, google, unknown)</li>
  <li><code class="language-plaintext highlighter-rouge">endpoint</code>: ~4 values (gemma, mistral, anthropic, google)</li>
  <li><code class="language-plaintext highlighter-rouge">model</code>: ~6 values (bounded by configured models)</li>
  <li><code class="language-plaintext highlighter-rouge">tier</code>: 4 values (0, 1, 2, 3)</li>
  <li><code class="language-plaintext highlighter-rouge">status</code>: 2 values (success, error)</li>
</ul>

<p>Total theoretical cardinality: 2 × 4 × 4 × 6 × 4 × 2 = 1,536 series. Acceptable for a single-instance deployment.</p>

<p><strong>Lesson learned:</strong> I added the <code class="language-plaintext highlighter-rouge">model</code> label to <code class="language-plaintext highlighter-rouge">REQUESTS_TOTAL</code> late, which broke existing Grafana queries. Design your metrics schema before writing dashboards.</p>

<hr />

<h2 id="deployment-architecture">Deployment Architecture</h2>

<h3 id="docker-compose-stack">Docker Compose Stack</h3>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">services</span><span class="pi">:</span>
  <span class="na">sentinel</span><span class="pi">:</span>       <span class="c1"># The gateway (FastAPI)</span>
  <span class="na">prometheus</span><span class="pi">:</span>     <span class="c1"># Metrics storage</span>
  <span class="na">grafana</span><span class="pi">:</span>        <span class="c1"># Visualization</span>
  <span class="na">loki</span><span class="pi">:</span>           <span class="c1"># Log aggregation  </span>
  <span class="na">tempo</span><span class="pi">:</span>          <span class="c1"># Distributed tracing</span>
  
  <span class="c1"># Optional - only with --profile containerized</span>
  <span class="na">ollama</span><span class="pi">:</span>         <span class="c1"># Local inference (if no native Ollama)</span>
</code></pre></div></div>

<p><strong>Default setup:</strong> Ollama runs natively on the Mac Mini M4 to leverage Metal GPU acceleration. The gateway connects via <code class="language-plaintext highlighter-rouge">host.docker.internal:11434</code>.</p>

<p><strong>Containerized option:</strong> For environments without native Ollama, run with the profile flag:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker-compose <span class="nt">--profile</span> containerized up <span class="nt">-d</span>
</code></pre></div></div>

<p>This pulls and runs <code class="language-plaintext highlighter-rouge">ollama/ollama:latest</code> with gemma3:4b and mistral pre-loaded.</p>

<h3 id="design-decision-why-not-kubernetes">Design Decision: Why Not Kubernetes?</h3>

<p><strong>Trade-off considered:</strong></p>

<table>
  <thead>
    <tr>
      <th>Deployment</th>
      <th>Pros</th>
      <th>Cons</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Docker Compose</td>
      <td>Simple, local-friendly</td>
      <td>Single node only</td>
    </tr>
    <tr>
      <td>Kubernetes</td>
      <td>Scalable, production-grade</td>
      <td>Complexity overhead</td>
    </tr>
  </tbody>
</table>

<p><strong>My choice:</strong> Docker Compose for v1.</p>

<p><strong>Reasoning:</strong></p>
<ol>
  <li>Primary use case is local inference on a single Mac Mini M4</li>
  <li>K8s manifests are planned for Phase 6</li>
  <li>Compose is sufficient to demonstrate the architecture</li>
  <li>Lower barrier to entry for people trying the project</li>
</ol>

<hr />

<h2 id="configuration-philosophy">Configuration Philosophy</h2>

<h3 id="environment-variables-vs-env-file">Environment Variables vs .env File</h3>

<p>I use <a href="https://docs.pydantic.dev/latest/concepts/pydantic_settings/">pydantic-settings</a> which supports both:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Settings</span><span class="p">(</span><span class="n">BaseSettings</span><span class="p">):</span>
    <span class="n">model_config</span> <span class="o">=</span> <span class="n">SettingsConfigDict</span><span class="p">(</span>
        <span class="n">env_prefix</span><span class="o">=</span><span class="s">"SENTINEL_"</span><span class="p">,</span>
        <span class="n">env_nested_delimiter</span><span class="o">=</span><span class="s">"__"</span><span class="p">,</span>
        <span class="n">env_file</span><span class="o">=</span><span class="s">".env"</span><span class="p">,</span>
        <span class="n">env_file_encoding</span><span class="o">=</span><span class="s">"utf-8"</span><span class="p">,</span>
        <span class="n">extra</span><span class="o">=</span><span class="s">"ignore"</span><span class="p">,</span>
    <span class="p">)</span>
    
    <span class="c1"># Nested config classes
</span>    <span class="n">local</span><span class="p">:</span> <span class="n">LocalBackendsConfig</span>
    <span class="n">cloud</span><span class="p">:</span> <span class="n">CloudBackendsConfig</span>
    <span class="n">cloud_selection</span><span class="p">:</span> <span class="n">CloudSelectionConfig</span>
    <span class="n">session</span><span class="p">:</span> <span class="n">SessionConfig</span>
    <span class="n">shadow</span><span class="p">:</span> <span class="n">ShadowConfig</span>
    <span class="n">controller</span><span class="p">:</span> <span class="n">ControllerSettings</span>
    <span class="n">telemetry</span><span class="p">:</span> <span class="n">TelemetryConfig</span>
</code></pre></div></div>

<p><strong>Priority order:</strong></p>
<ol>
  <li>Environment variables (highest)</li>
  <li><code class="language-plaintext highlighter-rouge">.env</code> file</li>
  <li>Default values (lowest)</li>
</ol>

<p><strong>Lesson learned:</strong> This caused a subtle bug. <code class="language-plaintext highlighter-rouge">SENTINEL_LOCAL__SELECTION_STRATEGY=priority</code> in my shell was overriding <code class="language-plaintext highlighter-rouge">selection_strategy: round_robin</code> that I expected from defaults. The nested delimiter <code class="language-plaintext highlighter-rouge">__</code> is powerful but can be surprising.</p>

<h3 id="semantic-validation-with-pydantic">Semantic Validation with Pydantic</h3>

<p>After getting bitten by invalid config values, I added range constraints:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">SessionConfig</span><span class="p">(</span><span class="n">BaseSettings</span><span class="p">):</span>
    <span class="n">lock_threshold_tier</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="n">Field</span><span class="p">(</span>
        <span class="n">default</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span>
        <span class="n">ge</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>       <span class="c1"># Must be &gt;= 1
</span>        <span class="n">le</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span>       <span class="c1"># Must be &lt;= 3
</span>        <span class="n">description</span><span class="o">=</span><span class="s">"Minimum tier to trigger LOCAL_LOCKED"</span>
    <span class="p">)</span>
    <span class="n">ttl_seconds</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="n">Field</span><span class="p">(</span>
        <span class="n">default</span><span class="o">=</span><span class="mi">900</span><span class="p">,</span>
        <span class="n">ge</span><span class="o">=</span><span class="mi">60</span><span class="p">,</span>      <span class="c1"># At least 1 minute
</span>        <span class="n">le</span><span class="o">=</span><span class="mi">86400</span><span class="p">,</span>   <span class="c1"># At most 1 day
</span>    <span class="p">)</span>
</code></pre></div></div>

<p>This prevents <code class="language-plaintext highlighter-rouge">lock_threshold_tier: 5</code> from being accepted — pydantic will raise a validation error at startup.</p>

<hr />

<h2 id="what-id-do-differently">What I’d Do Differently</h2>

<h3 id="1-start-with-structured-logging">1. Start with Structured Logging</h3>

<p>I added structlog late. Retrofitting structured logging is painful. Start with it from day one.</p>

<h3 id="2-integration-tests-earlier">2. Integration Tests Earlier</h3>

<p>Unit tests are great but don’t catch issues like “the regex works but the NER model isn’t loaded in Docker.” Integration tests against a real Ollama instance would have caught several bugs earlier.</p>

<h3 id="3-add-semantic-validators-early">3. Add Semantic Validators Early</h3>

<p>I initially relied on pydantic’s type validation alone. After deploying with an invalid <code class="language-plaintext highlighter-rouge">lock_threshold_tier: 5</code>, I learned to add <code class="language-plaintext highlighter-rouge">ge=</code>/<code class="language-plaintext highlighter-rouge">le=</code> constraints from the start. Now invalid configs fail fast at startup.</p>

<h3 id="4-metrics-design-up-front">4. Metrics Design Up Front</h3>

<p>I added the <code class="language-plaintext highlighter-rouge">model</code> label to <code class="language-plaintext highlighter-rouge">REQUESTS_TOTAL</code> late, which broke existing Grafana queries. Design your metrics schema before writing dashboards.</p>

<hr />

<h2 id="coming-up-part-2">Coming Up: Part 2</h2>

<p>In Part 2, I’ll share the benchmarking results:</p>

<ul>
  <li>Classification accuracy across 1000+ test cases</li>
  <li>Latency percentiles (TTFT, ITL, TPOT) by model</li>
  <li>Shadow mode similarity distributions</li>
  <li>Cost analysis: cloud vs local routing</li>
  <li>Controller recommendation accuracy</li>
</ul>

<hr />

<p><strong>GitHub:</strong> <a href="https://github.com/kraghavan/inference-sentinel">github.com/kraghavan/inference-sentinel</a></p>

<p><em>Questions or feedback? Connect with me on <a href="https://linkedin.com/in/karthikaraghavan">LinkedIn</a>.</em></p>]]></content><author><name>Karthika Raghavan</name></author><category term="llm" /><category term="infrastructure" /><category term="smart gateway" /><category term="python" /><category term="python" /><category term="privacy" /><category term="observability" /><category term="distributed-systems" /><summary type="html"><![CDATA[Part 1 of 2: Design decisions, trade-offs, and lessons from building inference-sentinel]]></summary></entry></feed>