llm-d orchestrates Kubernetes LLM inference

// 1h agoOPENSOURCE RELEASE

llm-d orchestrates Kubernetes LLM inference

llm-d is a Kubernetes-native orchestration framework for distributed and disaggregated LLM inference serving on top of engines like vLLM and SGLang. By integrating with the Kubernetes Gateway API (Inference Extension), llm-d provides prefix-cache-aware routing, tiered KV-cache offloading, disaggregated prefill/decode serving, and SLO-aware autoscaling based on queue demand.

// ANALYSIS

Scaling LLM inference across clusters is currently fragmented, and llm-d's approach of bringing standardized, Kubernetes-native orchestration to the inference layer is a major step toward productizing AI infrastructure.

–Prefix-Cache-Aware Routing: Centrally tracking KV-cache locations and routing requests to nodes with matching context minimizes duplicate calculations, achieving 3x higher output throughput and 2x faster TTFT.
–Disaggregated serving: Separating resource-heavy prefill operations from latency-sensitive decode operations allows infrastructure teams to optimize hardware utilization and allocate separate, specialized GPU pools.
–Heavyweight Backing: Supported by Red Hat, Google Cloud, NVIDIA, IBM Research, and CoreWeave, the project has a strong foundation to establish itself as the standard for enterprise Kubernetes AI deployments.

// TAGS

llm-dkubernetesinferencevllmsglangkv-cacheroutingopensource

DISCOVERED

1h ago

2026-06-29

PUBLISHED

1h ago

2026-06-29

RELEVANCE

8/ 10

AUTHOR

GithubProjects

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

NEWS2h ago

xAI to release new model every month

Elon Musk has announced that xAI plans to release a brand-new AI model every month for the remainder of the year, signaling a pivot toward rapid, continuous iteration. Leveraging infrastructure and feedback from SpaceX and Starlink, this monthly roadmap aims to accelerate the deployment of trained-from-scratch models.

NEWS3h ago

GPT-5.6 Leads Polymarket Top AI Race

OpenAI's GPT-5.6 leads the Polymarket prediction race for the top AI model by June 30, with Sakana AI's newly launched Fugu platform emerging as a wildcard challenger. While OpenAI remains the frontrunner, rapid multi-agent developments and infrastructure upgrades continue to shift trader expectations before the deadline.

POLICY5h ago

Age verification laws force identity attribution

Age verification regulations across the US, Europe, and Australia fundamentally serve as identity attribution systems that link digital accounts to real-world identities. The setup could lead to automated tracking of online speech, prompting warnings to resist verification or pay with privacy-focused methods like Monero.