BACK_TO_FEEDAICRIER_2
Local GPU Rig Eyes SGLang Stack
OPEN_SOURCE ↗
REDDIT · REDDIT// 22d agoINFRASTRUCTURE

Local GPU Rig Eyes SGLang Stack

The poster is setting up a dual RTX PRO 6000 machine for local inference, with OpenCode as the coding-agent harness and Qwen3.5 variants as the main model sweep. They also want SGLang for radix cache reuse and continuous batching, plus a local chat UI and a separate image-gen setup.

// ANALYSIS

The instinct is solid: when you’re running lots of similar agent sessions, the serving layer matters almost as much as the model choice. SGLang is a sensible center of gravity here, but the stack will only stay pleasant if inference, orchestration, and UI remain decoupled.

  • RadixAttention only pays off when prompts share meaningful prefixes, so the cache wins will be best for repeated system prompts, tool schemas, and agent scaffolding.
  • Continuous batching improves throughput, not magic single-request latency, so model selection still needs to balance quality with decode speed.
  • OpenCode is a good terminal harness, but a lightweight model router or OpenAI-compatible endpoint registry will age better than shell-script sprawl.
  • Keep FLUX.2 or other image workloads isolated if you can; image generation can steal VRAM and make the LLM side feel flaky.
  • A dedicated chat UI is worth separate treatment from the coding harness, because one-off chat and agent workflows tend to want different defaults and histories.
// TAGS
sglangopencodecomfyuiflux-2qwen3inferencegpuagent

DISCOVERED

22d ago

2026-03-20

PUBLISHED

22d ago

2026-03-20

RELEVANCE

7/ 10

AUTHOR

ipcoffeepot