OPEN_SOURCE ↗
REDDIT · REDDIT// 22d agoINFRASTRUCTURE
Local GPU Rig Eyes SGLang Stack
The poster is setting up a dual RTX PRO 6000 machine for local inference, with OpenCode as the coding-agent harness and Qwen3.5 variants as the main model sweep. They also want SGLang for radix cache reuse and continuous batching, plus a local chat UI and a separate image-gen setup.
// ANALYSIS
The instinct is solid: when you’re running lots of similar agent sessions, the serving layer matters almost as much as the model choice. SGLang is a sensible center of gravity here, but the stack will only stay pleasant if inference, orchestration, and UI remain decoupled.
- –RadixAttention only pays off when prompts share meaningful prefixes, so the cache wins will be best for repeated system prompts, tool schemas, and agent scaffolding.
- –Continuous batching improves throughput, not magic single-request latency, so model selection still needs to balance quality with decode speed.
- –OpenCode is a good terminal harness, but a lightweight model router or OpenAI-compatible endpoint registry will age better than shell-script sprawl.
- –Keep FLUX.2 or other image workloads isolated if you can; image generation can steal VRAM and make the LLM side feel flaky.
- –A dedicated chat UI is worth separate treatment from the coding harness, because one-off chat and agent workflows tend to want different defaults and histories.
// TAGS
sglangopencodecomfyuiflux-2qwen3inferencegpuagent
DISCOVERED
22d ago
2026-03-20
PUBLISHED
22d ago
2026-03-20
RELEVANCE
7/ 10
AUTHOR
ipcoffeepot