BACK_TO_FEEDAICRIER_2
MOLA debuts multi-LoRA serving on Apple Silicon
OPEN_SOURCE ↗
REDDIT · REDDIT// 17d agoOPENSOURCE RELEASE

MOLA debuts multi-LoRA serving on Apple Silicon

MOLA is an alpha MLX-native multi-LoRA inference server for Apple Silicon that keeps one base model loaded and routes adapters per request. Its published benchmark on Qwen3.5-9B-MLX-4bit with 8 resident adapters shows mixed-adapter traffic stays usable, even as throughput drops under load.

// ANALYSIS

The interesting part here is less the feature than the portability gap it closes: CUDA stacks already made multi-LoRA serving feel normal, and MOLA makes that workflow plausible on Apple Silicon. The project still reads like serious infrastructure in progress, not a polished runtime, but the benchmark is strong enough to justify the experiment.

  • On an Apple M5 Max 64GB, same-adapter vs mixed-adapter throughput is identical at concurrency 1 and only diverges once requests overlap, which is exactly the point where adapter routing starts to matter.
  • The mixed-workload penalty is real, about 22% at concurrency 16 and 24% at 64, but that is a reasonable trade if it avoids reloading full fine-tuned checkpoints.
  • The OpenAI-compatible API, per-request `model` selector, and runtime adapter hot-load/unload make it practical for local specialist workflows like Rust, SQL, and ops.
  • The main blockers are also clear: a local `mlx-lm` patch is still required, KV cache reuse breaks when adapters switch mid-conversation, and the whole stack is Apple Silicon-only for now.
// TAGS
molainferenceopen-sourceself-hostedllmapibenchmark

DISCOVERED

17d ago

2026-03-25

PUBLISHED

17d ago

2026-03-25

RELEVANCE

8/ 10

AUTHOR

No_Shift_4543