Recent Posts

Software Layers Before Inference At Scale

The LLM Serving Stack You Actually Need: From NGINX to llm-d to GPU

I Built a Five-Agent SRE War Room and Grafana Watched Every Token

Grafana AI Observability launched on April 21. I spent a few days building a five-agent incident response system to put it through its paces. Here’s what act...

P/D Disaggregation on a Single GPU — What the Architecture Actually Requires

I deployed llm-d’s P/D disaggregation guide — separate prefill and decode pods, NIXL sidecar, the full architecture. The pods ran. The NIXL KV transfers show...

llm-d in Action — EPP Prefix Cache Routing and What It Actually Means

The stack is deployed. Now let’s see what it actually does. EPP prefix cache routing, 81.1% KV cache hit rate, TTFT at 15ms p50, and what those numbers mean ...

Deploying llm-d on a Cloud GPU — The 10 Things Nobody Tells You

I deployed llm-d on a Lambda Labs GH200. Nothing worked first try. Here is the honest account of what broke, why, and how to fix it — so you don’t spend your...

Treating the M4 Mac Mini Like a Production Inference Server (It Tried)

I treated an M4 Mac Mini as a production-like inference environment — wired up Prometheus, Grafana, a kind cluster with nginx, and ran real load tests. Here’...

What Is LLM Inference, Really? A Deep Technical Walkthrough

An Engineer’s annotated tour through what actually happens when you hit send — from bytes to tokens to embeddings to attention to the word your model finally...

Schema Travels Architecture

Translating SQL to NoSQL: Architecture Deep-Dive

Building a Privacy-Aware LLM Gateway: Benchmarking Results

Part 2 of 2: Empirical evaluation of classification accuracy, routing performance, and cost attribution — with honest analysis of failure modes

Building a Privacy-Aware LLM Gateway: Architecture Deep-Dive

Part 1 of 2: Design decisions, trade-offs, and lessons from building inference-sentinel

Karthika Raghavan

Recent Posts

Recent Posts