OPEN_SOURCE ↗
REDDIT · REDDIT// 17d agoOPENSOURCE RELEASE
MOLA debuts multi-LoRA serving on Apple Silicon
MOLA is an alpha MLX-native multi-LoRA inference server for Apple Silicon that keeps one base model loaded and routes adapters per request. Its published benchmark on Qwen3.5-9B-MLX-4bit with 8 resident adapters shows mixed-adapter traffic stays usable, even as throughput drops under load.
// ANALYSIS
The interesting part here is less the feature than the portability gap it closes: CUDA stacks already made multi-LoRA serving feel normal, and MOLA makes that workflow plausible on Apple Silicon. The project still reads like serious infrastructure in progress, not a polished runtime, but the benchmark is strong enough to justify the experiment.
- –On an Apple M5 Max 64GB, same-adapter vs mixed-adapter throughput is identical at concurrency 1 and only diverges once requests overlap, which is exactly the point where adapter routing starts to matter.
- –The mixed-workload penalty is real, about 22% at concurrency 16 and 24% at 64, but that is a reasonable trade if it avoids reloading full fine-tuned checkpoints.
- –The OpenAI-compatible API, per-request `model` selector, and runtime adapter hot-load/unload make it practical for local specialist workflows like Rust, SQL, and ops.
- –The main blockers are also clear: a local `mlx-lm` patch is still required, KV cache reuse breaks when adapters switch mid-conversation, and the whole stack is Apple Silicon-only for now.
// TAGS
molainferenceopen-sourceself-hostedllmapibenchmark
DISCOVERED
17d ago
2026-03-25
PUBLISHED
17d ago
2026-03-25
RELEVANCE
8/ 10
AUTHOR
No_Shift_4543