OPEN_SOURCE ↗
REDDIT · REDDIT// 18d agoINFRASTRUCTURE
Tesla P40 tempts budget LLM builders
Users say the Tesla P40 can still handle modern Qwen, Mixtral, and Llama 30B-class models, but only with aggressive quantization and a lot of patience once context grows. It can work for chat and light coding, but it remains a cheap route to local LLM inference rather than a polished everyday rig.
// ANALYSIS
The P40 is a VRAM bargain, not a speed bargain. It can make local 30B tinkering viable, but the minute you ask for long context or smoother coding loops, Pascal-era bottlenecks show up fast.
- –Community reports range from about 8-9 tokens/sec on a 30B GPTQ model to ~18 tokens/sec on a lighter 30B quant; a newer 32B coding benchmark put a single P40 around 10 tokens/sec, so the practical ceiling depends heavily on quantization and loader choice.
- –The real pain point is prompt processing: one user saw 13B performance fall from ~22 tokens/sec with almost no context to 2-4 tokens/sec at 7-8k context, and another noted that 30B runs stay around 10-20 tokens/sec but pre-processing gets much slower.
- –MoE models are the loophole: Qwen3 30B-A3B only activates about 3B weights per token, so it can feel dramatically faster than a dense 30B if the whole model fits in VRAM.
- –That makes the card practical for chat and light coding, but it gets much less comfortable once you need long back-and-forths or heavy context.
- –The hidden cost is mechanical: P40s are passive, run hot, and need proper cooling, power adapters, and a sane driver stack; if you want a single-card, turn-key experience, a 3090 still wins.
// TAGS
gpuinferenceself-hostedllmpricingbenchmarktesla-p40
DISCOVERED
18d ago
2026-03-25
PUBLISHED
18d ago
2026-03-25
RELEVANCE
7/ 10
AUTHOR
ScarredPinguin