Software Layers Before Inference At Scale
The LLM Serving Stack You Actually Need: From NGINX to llm-d to GPU
The LLM Serving Stack You Actually Need: From NGINX to llm-d to GPU
Grafana AI Observability launched on April 21. I spent a few days building a five-agent incident response system to put it through its paces. Here’s what act...
I deployed llm-d’s P/D disaggregation guide — separate prefill and decode pods, NIXL sidecar, the full architecture. The pods ran. The NIXL KV transfers show...
The stack is deployed. Now let’s see what it actually does. EPP prefix cache routing, 81.1% KV cache hit rate, TTFT at 15ms p50, and what those numbers mean ...
I deployed llm-d on a Lambda Labs GH200. Nothing worked first try. Here is the honest account of what broke, why, and how to fix it — so you don’t spend your...
I treated an M4 Mac Mini as a production-like inference environment — wired up Prometheus, Grafana, a kind cluster with nginx, and ran real load tests. Here’...
An Engineer’s annotated tour through what actually happens when you hit send — from bytes to tokens to embeddings to attention to the word your model finally...
Translating SQL to NoSQL: Architecture Deep-Dive
Part 2 of 2: Empirical evaluation of classification accuracy, routing performance, and cost attribution — with honest analysis of failure modes
Part 1 of 2: Design decisions, trade-offs, and lessons from building inference-sentinel