YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

GPT-OSS-120B hits 1B tokens/day locally

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

GPT-OSS-120B hits 1B tokens/day locally
OPEN LINK ↗
// 49d agoTUTORIAL

GPT-OSS-120B hits 1B tokens/day locally

A university hospital research lab shares a production writeup on serving GPT-OSS-120B across two H200s with vLLM and LiteLLM, pushing more than 1B tokens/day. The post is a practical deployment guide focused on throughput, routing, caching, and stability under load.

// ANALYSIS

This is less a launch announcement than a rare real-world ops report: the model is fast enough to matter, but the bigger story is how much tuning it takes to keep a local serving stack efficient at sustained scale.

  • Two single-GPU vLLM replicas make more sense here than tensor parallel, because the model fits on one H200 and the setup avoids NCCL overhead.
  • `simple-shuffle` plus prefix caching looks like the right call for mixed high-volume traffic, and the reported GPU split is nearly perfect.
  • The writeup surfaces a real production hazard: logprobs requests can spike memory enough to OOM the server, so VRAM headroom is not optional.
  • The weak point is LiteLLM failover behavior, where cooldown and retries can create a ping-pong effect that cuts effective capacity in half.
  • The broader takeaway is that MXFP4 on Hopper is currently the sweet spot for open-weight local inference when throughput matters more than squeezing every last bit of model size.
// TAGS
gpt-oss-120bllminferencegpuself-hostedopen-weightsmlops

DISCOVERED

49d ago

2026-04-07

PUBLISHED

49d ago

2026-04-07

RELEVANCE

8/ 10

AUTHOR

SessionComplete2334