DigitalOcean partners to cut inference costs

// 45d agoINFRASTRUCTURE

DigitalOcean partners to cut inference costs

DigitalOcean has partnered with vLLM developer Inferact to bring prefix-aware routing and prefix caching to its Serverless Inference platform. By tracking vLLM instance KV cache events, the Inference Gateway routes requests to pods with matching warm cache blocks, boosting hit rates from 25% to over 75%.

// ANALYSIS

Standard load balancers are the silent killers of LLM cache efficiency, making custom prefix-aware routers a mandatory architectural pattern for high-scale agentic and RAG workloads.

* Standard round-robin load balancing negates engine-level prefix caching by spreading requests across instances, resulting in a low ~25% cache hit rate.

* DigitalOcean's Inference Gateway (using Envoy's ext_proc callback and a global KV-Block Index) resolves this mismatch by routing requests based on prefix hash affinity.

* Expanding GPU memory capacities (e.g., AMD MI325X with 192GB, H200 with 141GB) combined with FP8 KV quantization increases the lifespan of cached prefixes, maximizing the effectiveness of routing heuristics.

* The introduction of tiered GPU/CPU caches and cached token pricing discounts makes serverless inference commercially competitive with dedicated GPU provisioning.

// TAGS

digitaloceanvllmprefix-cachingload-balancingserverless-inferenceinference-gatewaygpu-optimization

DISCOVERED

45d ago

2026-06-03

PUBLISHED

45d ago

2026-06-03

RELEVANCE

8/ 10

AUTHOR

digitalocean

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE12m ago

Apache Ossie enters Apache Incubator

Apache Ossie is an open-source specification designed to standardize semantic metadata sharing across analytics, AI, and business intelligence platforms. Currently incubating under the Apache Software Foundation, the project provides a vendor-neutral, single source of truth using machine-readable JSON and YAML definitions.

LAUNCH15m ago

Browser Use launches Browser Use Cloud

Browser Use Cloud is a managed infrastructure platform built to run open-source browser-use agents at scale. The hosted environment handles proxy rotation, anti-bot protection, and CAPTCHA solving via a single API key.

UPDATE17m ago

Hex voice prompting tool comes to Linux

Hex, the macOS push-to-talk voice dictation utility developed by Kit Langton, is being ported to Linux. The utility allows developers to dictate text prompts directly into their active terminal or editor using local, privacy-preserving speech-to-text models.