OPEN_SOURCE ↗
REDDIT · REDDIT// 9d agoINFRASTRUCTURE
Gemma 4 32B Trips TensorRT-LLM Setup
A Reddit user is asking for help getting Gemma 4 32B running on an RTX 6000 Pro with TensorRT-LLM, after failed weight conversion and auto-deployment attempts. The thread also compares vLLM and Modular MAX as serving options, with the author later noting Modular MAX eventually worked.
// ANALYSIS
This is less a launch story than a real-world deployment check: the fastest inference stack on paper is often the one with the sharpest setup edge cases.
- –TensorRT-LLM still looks powerful, but model conversion and deployment friction can erase the theoretical gains for newer models like Gemma 4 32B
- –vLLM remains the practical baseline because it is easier to get running, even when it is not the absolute fastest
- –Modular MAX becoming usable in the thread is a reminder that newer serving stacks are still proving themselves in day-to-day workflows
- –The RTX 6000 Pro angle matters: high-end hardware does not eliminate compatibility and tooling issues
- –This is useful signal for anyone benchmarking serving engines, because “works eventually” is not the same as “works reliably”
// TAGS
gemma-4tensorrt-llmvllmmodular-maxinferencegpu
DISCOVERED
9d ago
2026-04-03
PUBLISHED
9d ago
2026-04-03
RELEVANCE
8/ 10
AUTHOR
kev_11_1