REDDIT · REDDIT// 14d agoINFRASTRUCTURE

LocalLLaMA user seeks 5090-ready stack

A LocalLLaMA user with a 32 GB RTX 5090, Windows 11, and a carefully split ComfyUI setup is asking for genuinely new open-source models across image, video, audio, multimodal, and LLM workflows. The biggest gap is serving: they want a non-GGUF local LLM runtime that can better saturate the GPU, with EXL2, AWQ, vLLM, TabbyAPI, and TensorRT-LLM all in play.

// ANALYSIS

This reads like a high-signal power-user post: the hardware is already strong enough that the bottleneck is no longer raw compute, it's choosing the right serving layer and avoiding redundant model churn.

–`TabbyAPI` + `ExLlamaV3` is the cleanest Windows-friendly answer for a 5090, because it gives an OpenAI-style API and supports `EXL2`/`EXL3`, `GPTQ`, and `FP16` without forcing GGUF.
–If they are willing to leave native Windows, `SGLang` is the more modern high-performance serving layer, while `vLLM` remains strong but is still not Windows-native.
–`TensorRT-LLM` is the true "squeeze every last token/sec" play on NVIDIA hardware, but it is the most setup-heavy route and feels more like a Linux/WSL performance project than a casual desktop runtime.
–For fresh models, the most useful adds are `Qwen3` and `Qwen3-VL` for LLM/multimodal work, `HiDream-I1` and `Qwen-Image-Edit` for image generation/editing, `UniVerse-1` for audio-video generation, and `YuE` plus `Chatterbox TTS` for music and voice.
–Their four-environment split is already the right architecture, so the best end-to-end pipeline is an OpenAI-compatible LLM backend feeding specialized image, video, and audio workers instead of one monolithic app.

// TAGS

local-llamallminferencegpuopen-sourceself-hostedmultimodalimage-gen

DISCOVERED

14d ago

2026-03-28

PUBLISHED

14d ago

2026-03-28

RELEVANCE

8/ 10

AUTHOR

Elegur