BACK_TO_FEEDAICRIER_2
Dual 3090 PC weighs local agents
OPEN_SOURCE ↗
REDDIT · REDDIT// 7h agoINFRASTRUCTURE

Dual 3090 PC weighs local agents

A LocalLLaMA user is planning a fully offline agentic coding setup on dual RTX 3090s with 128GB RAM, weighing Qwen models, vLLM, llama.cpp, coding agents, and speech-to-text. The thread reflects a broader shift from “can I run a local LLM?” to “can I run a useful private coding agent stack?”

// ANALYSIS

The practical answer is less about the biggest model and more about keeping latency, context, and tool-calling reliable enough for daily coding.

  • Dual 3090s give 48GB VRAM, which is strongest for 30B-35B class coding models at higher quantization or 70B-class models at more painful speed/quality tradeoffs
  • vLLM is the better default when throughput, batching, long context, and OpenAI-compatible serving matter; llama.cpp still wins for GGUF simplicity, hybrid CPU/GPU offload, and quick experimentation
  • Qwen3.5/Qwen3.6 27B-35B class models are the likely sweet spot for local coding agents, while 100B+ “orchestrator” setups risk becoming slow demos unless the workflow tolerates latency
  • Agent harness choice matters as much as model choice: OpenCode, Cline-style flows, Claude Code-compatible adapters, and local OpenAI APIs are where the stack either becomes productive or collapses into prompt fiddling
  • Adding Whisper or whisper.cpp for local STT makes sense, but it is secondary to nailing inference stability, context length, and tool-call correctness first
// TAGS
local-agentic-coding-workstationai-codingagentinferencegpuself-hostedopen-weightsspeech

DISCOVERED

7h ago

2026-04-21

PUBLISHED

11h ago

2026-04-21

RELEVANCE

7/ 10

AUTHOR

youcloudsofdoom