BACK_TO_FEEDAICRIER_2
GPT-OSS-120B hits 1B tokens/day locally
OPEN_SOURCE ↗
REDDIT · REDDIT// 4d agoTUTORIAL

GPT-OSS-120B hits 1B tokens/day locally

A university hospital research lab shares a production writeup on serving GPT-OSS-120B across two H200s with vLLM and LiteLLM, pushing more than 1B tokens/day. The post is a practical deployment guide focused on throughput, routing, caching, and stability under load.

// ANALYSIS

This is less a launch announcement than a rare real-world ops report: the model is fast enough to matter, but the bigger story is how much tuning it takes to keep a local serving stack efficient at sustained scale.

  • Two single-GPU vLLM replicas make more sense here than tensor parallel, because the model fits on one H200 and the setup avoids NCCL overhead.
  • `simple-shuffle` plus prefix caching looks like the right call for mixed high-volume traffic, and the reported GPU split is nearly perfect.
  • The writeup surfaces a real production hazard: logprobs requests can spike memory enough to OOM the server, so VRAM headroom is not optional.
  • The weak point is LiteLLM failover behavior, where cooldown and retries can create a ping-pong effect that cuts effective capacity in half.
  • The broader takeaway is that MXFP4 on Hopper is currently the sweet spot for open-weight local inference when throughput matters more than squeezing every last bit of model size.
// TAGS
gpt-oss-120bllminferencegpuself-hostedopen-weightsmlops

DISCOVERED

4d ago

2026-04-07

PUBLISHED

4d ago

2026-04-07

RELEVANCE

8/ 10

AUTHOR

SessionComplete2334