Qwen3.5 Small hits Mac throughput wall
A LocalLLaMA user says Qwen3.5 Small's 9B 4-bit MLX quant is excellent for OCR, but running three or four copies on an M4 Max with 36GB RAM does not improve throughput. They are looking for a real batching or parallelization path on macOS after mlx-vlm's batch_generate failed to help.
This is a throughput problem, not a RAM problem. On Apple Silicon, extra model copies usually compete for the same unified memory and GPU time unless the serving layer batches requests. mlx-lm already points developers toward batch generation, prompt caching, and distributed inference, which suggests the right fix is workload batching rather than model duplication. The community has already asked for batch and parallel processing in mlx_lm.server, so this bottleneck is a known gap in the stack. mlx-vlm's docs focus on one-model-at-a-time serving with dynamic loading, which makes it flexible but not a high-concurrency engine. For OCR pipelines, grouping documents by prompt shape and minimizing repeated context will usually beat spinning up more standalone processes.
DISCOVERED
14d ago
2026-03-28
PUBLISHED
14d ago
2026-03-28
RELEVANCE
AUTHOR
ZhopaRazzi