GPT-OSS-120B hits 1B tokens/day locally

// 96d agoTUTORIAL

GPT-OSS-120B hits 1B tokens/day locally

A university hospital research lab shares a production writeup on serving GPT-OSS-120B across two H200s with vLLM and LiteLLM, pushing more than 1B tokens/day. The post is a practical deployment guide focused on throughput, routing, caching, and stability under load.

// ANALYSIS

This is less a launch announcement than a rare real-world ops report: the model is fast enough to matter, but the bigger story is how much tuning it takes to keep a local serving stack efficient at sustained scale.

–Two single-GPU vLLM replicas make more sense here than tensor parallel, because the model fits on one H200 and the setup avoids NCCL overhead.
–`simple-shuffle` plus prefix caching looks like the right call for mixed high-volume traffic, and the reported GPU split is nearly perfect.
–The writeup surfaces a real production hazard: logprobs requests can spike memory enough to OOM the server, so VRAM headroom is not optional.
–The weak point is LiteLLM failover behavior, where cooldown and retries can create a ping-pong effect that cuts effective capacity in half.
–The broader takeaway is that MXFP4 on Hopper is currently the sweet spot for open-weight local inference when throughput matters more than squeezing every last bit of model size.

// TAGS

gpt-oss-120bllminferencegpuself-hostedopen-weightsmlops

DISCOVERED

96d ago

2026-04-07

PUBLISHED

96d ago

2026-04-07

RELEVANCE

8/ 10

AUTHOR

SessionComplete2334

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

VIDEO1h ago

Higgsfield drops developer CLI and MCP server

Higgsfield has launched a developer CLI and MCP server, allowing programmers and autonomous agents to programmatically trigger, customize, and edit marketing ads and cinematic videos directly through terminal commands. Demonstrated by developer Cole Medin using Anthropic's Claude Code and the Archon workflow engine, the toolkit enables fully automated video production pipelines.

OPEN SOURCE1h ago

AI Content Factory automates video ads

AI Content Factory is an open-source workflow that automates bulk marketing video generation from a product catalog. Built on the Archon agentic engine and Higgsfield CLI, it reduces costs by gating expensive video rendering behind cheap image exploration and human approval.

NEWS3h ago

George Hotz shares his enthusiasm for LLMs and open-source coding agents while criticizing doom-mongering and the overinflated valuations of frontier AI labs.

George Hotz (geohot) details his excitement for the practical applications of AI—such as LLMs, self-driving cars, video generation models, and AI coding agents—highlighting his successful setup of the open-source agent OpenCode on a local GLM-5.2 model. However, he strongly criticizes the prevailing industry hype, safety-related doom-mongering, and the multibillion-dollar valuations of frontier AI labs. Hotz argues that frontier labs will fail to capture most of the AI value because AI is a commodity driven by Moore's law and general computing progress. He also frames coding models not as autonomous creators, but as valuable productivity tools analogous to compilers, find-and-replace, or Stack Overflow that are changing the nature of programming.