OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoINFRASTRUCTURE
OpenCode user weighs RTX 3090 swap
The post describes a developer running OpenCode with a local Qwen3.6-35B-A3B model through llama.cpp on a Tesla P40 + T4 pair, and asks whether swapping the P40 for an RTX 3090 is worth the cost. The current setup already delivers about 25-30 tokens/sec with 256k context, so the upgrade question is really about lowering latency and improving compatibility, not adding capacity.
// ANALYSIS
The 3090 is the right kind of upgrade if the goal is faster inference, but it is not a free win: the big gain comes from Ampere tensor hardware, higher memory bandwidth, and modern CUDA support, while the T4 and PCIe 3.0 host still limit how far the stack can scale.
- –RTX 3090 brings Ampere, 3rd gen Tensor Cores, and 24GB of GDDR6X, which is much better suited to current LLM inference than a Pascal-era P40.
- –The P40 was built for INT8 throughput, but it lacks the newer acceleration path and software headroom that makes today’s local coding setups smoother to run and maintain.
- –For a long-context workload like Qwen3.6-35B-A3B at 256k tokens, the KV cache and layer placement matter as much as raw VRAM, so real-world gains may be smaller than benchmark hype suggests.
- –In a 2U DL380 G9 with a hard budget cap, a blower 3090 is a pragmatic upgrade path, but the best value would still come from the fastest single GPU the chassis can physically and thermally tolerate.
- –The broader signal is strong: local coding agents are now good enough that users are optimizing home inference rigs the way others tune gaming PCs.
// TAGS
opencodellama-cppqwenai-codinginferencegpuself-hostedllm
DISCOVERED
3h ago
2026-04-25
PUBLISHED
4h ago
2026-04-24
RELEVANCE
7/ 10
AUTHOR
RoroTitiFR