MLX beats Ollama for complex MoE agent stacks

// 90d agoINFRASTRUCTURE

MLX beats Ollama for complex MoE agent stacks

ANNOUNCEMENT PRODUCT GITHUB PRODUCT HUNT

A developer reports that switching from Ollama to raw MLX for a 12-agent Qwen 35B stack on M1 Max fixed long-form generation "word-salad" issues and throughput bottlenecks. The move sacrifices convenience for in-process sectional generation and fine-grained sampler control.

// ANALYSIS

Power users are outgrowing the "Ollama wrapper" layer as agentic workflows demand deeper model control.

–Sectional generation is the MoE savior, preventing repetition collapse in Qwen 35B without the latency of HTTP round-trips.
–Raw MLX throughput is essential for "thinking" models, often doubling tokens-per-second compared to legacy local backends.
–Direct Python bindings enable sophisticated in-process orchestration like priority queues and per-call sampler adjustments.
–The tradeoff highlights a split between "it just works" users and developers building high-uptime, specialized agent stacks.

// TAGS

llminferenceagentapple-siliconmlxollamaqwenopen-source

DISCOVERED

90d ago

2026-04-24

PUBLISHED

90d ago

2026-04-24

RELEVANCE

8/ 10

AUTHOR

sleepy_quant

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

NEWS25m ago

AI supply-chain attacks leverage dormant model backdoors

AI supply-chain attacks leverage dormant backdoors in machine learning models that evade standard security testing. Because traditional signature-based antivirus systems cannot inspect complex model weights, compromised models bypass initial verification and pose severe production risks.

UPDATE30m ago

Surf introduces Research 2.0 for verifiable AI crypto research

Surf has announced Research 2.0, a major update to its AI-powered crypto research platform designed to solve the problem of unverified claims in automated market research. Rather than generating plausible-sounding text, Research 2.0 focuses on grounding answers in reliable, verifiable evidence across on-chain and off-chain data domains, providing traders and analysts with audit-ready insights.

MODEL41m ago

OpenRouter launches Qwen-Audio-3.0-TTS Flash and Plus

OpenRouter has made Alibaba's Qwen-Audio-3.0-TTS Flash and Plus text-to-speech models available, supporting 16 languages and inline emotional cues like [gasp] and [giggles]. Qwen-Audio-3.0-TTS Plus emphasizes speech naturalness and audio fidelity, while Flash targets low-latency real-time interactions and voice cloning.