DigitalOcean partners to cut inference costs
DigitalOcean has partnered with vLLM developer Inferact to bring prefix-aware routing and prefix caching to its Serverless Inference platform. By tracking vLLM instance KV cache events, the Inference Gateway routes requests to pods with matching warm cache blocks, boosting hit rates from 25% to over 75%.
Standard load balancers are the silent killers of LLM cache efficiency, making custom prefix-aware routers a mandatory architectural pattern for high-scale agentic and RAG workloads.
* Standard round-robin load balancing negates engine-level prefix caching by spreading requests across instances, resulting in a low ~25% cache hit rate.
* DigitalOcean's Inference Gateway (using Envoy's ext_proc callback and a global KV-Block Index) resolves this mismatch by routing requests based on prefix hash affinity.
* Expanding GPU memory capacities (e.g., AMD MI325X with 192GB, H200 with 141GB) combined with FP8 KV quantization increases the lifespan of cached prefixes, maximizing the effectiveness of routing heuristics.
* The introduction of tiered GPU/CPU caches and cached token pricing discounts makes serverless inference commercially competitive with dedicated GPU provisioning.
DISCOVERED
1h ago
2026-06-03
PUBLISHED
1h ago
2026-06-03
RELEVANCE
AUTHOR
digitalocean
