llm-d orchestrates Kubernetes LLM inference
llm-d is a Kubernetes-native orchestration framework for distributed and disaggregated LLM inference serving on top of engines like vLLM and SGLang. By integrating with the Kubernetes Gateway API (Inference Extension), llm-d provides prefix-cache-aware routing, tiered KV-cache offloading, disaggregated prefill/decode serving, and SLO-aware autoscaling based on queue demand.
Scaling LLM inference across clusters is currently fragmented, and llm-d's approach of bringing standardized, Kubernetes-native orchestration to the inference layer is a major step toward productizing AI infrastructure.
- –Prefix-Cache-Aware Routing: Centrally tracking KV-cache locations and routing requests to nodes with matching context minimizes duplicate calculations, achieving 3x higher output throughput and 2x faster TTFT.
- –Disaggregated serving: Separating resource-heavy prefill operations from latency-sensitive decode operations allows infrastructure teams to optimize hardware utilization and allocate separate, specialized GPU pools.
- –Heavyweight Backing: Supported by Red Hat, Google Cloud, NVIDIA, IBM Research, and CoreWeave, the project has a strong foundation to establish itself as the standard for enterprise Kubernetes AI deployments.
DISCOVERED
1h ago
2026-06-29
PUBLISHED
1h ago
2026-06-29
RELEVANCE
AUTHOR
GithubProjects