Qwen3.5 Small hits Mac throughput wall

// 105d agoINFRASTRUCTURE

Qwen3.5 Small hits Mac throughput wall

A LocalLLaMA user says Qwen3.5 Small's 9B 4-bit MLX quant is excellent for OCR, but running three or four copies on an M4 Max with 36GB RAM does not improve throughput. They are looking for a real batching or parallelization path on macOS after mlx-vlm's batch_generate failed to help.

// ANALYSIS

This is a throughput problem, not a RAM problem. On Apple Silicon, extra model copies usually compete for the same unified memory and GPU time unless the serving layer batches requests. mlx-lm already points developers toward batch generation, prompt caching, and distributed inference, which suggests the right fix is workload batching rather than model duplication. The community has already asked for batch and parallel processing in mlx_lm.server, so this bottleneck is a known gap in the stack. mlx-vlm's docs focus on one-model-at-a-time serving with dynamic loading, which makes it flexible but not a high-concurrency engine. For OCR pipelines, grouping documents by prompt shape and minimizing repeated context will usually beat spinning up more standalone processes.

// TAGS

qwen3-5-smallllminferenceedge-aimultimodalopen-source

DISCOVERED

105d ago

2026-03-28

PUBLISHED

105d ago

2026-03-28

RELEVANCE

7/ 10

AUTHOR

ZhopaRazzi

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

NEWS44m ago

OpenAI, xAI, Meta drop major models

The AI model landscape saw unprecedented rapid shifts over a 96-hour period. OpenAI released the GPT-5.6 family to general availability, xAI took Grok 4.5 public following the SpaceX merger, and Meta introduced a new paid Model API, marking significant paradigm shifts across major AI players.

INFRA55m ago

Ritual builds infrastructure for autonomous AI agents

Ritual is an AI lab and infrastructure project that aims to move beyond simply making AI models smarter by focusing on granting them autonomous agency. The project is developing the underlying stack—including cryptography, consensus, and privacy mechanisms—required for AI agents to operate persistently, hold and spend their own money, and execute tasks without needing manual human approval for every action.

OPEN SOURCE1h ago

Agent Skills guides agent UI design

Agent Skills is an open-source library and prompting system designed to help front-end coding agents like Cursor and Claude Code build premium user interfaces. The project provides reusable design guardrails and procedural workflows for advanced styling, GSAP animations, and WebGL.