Local Legal Stack Goes MoE
A lawyer updated a self-hosted legal drafting system built around 12 V100s, a second GPU box, and llama.cpp after moving away from vLLM for the models he actually wants to run. The stack now routes drafting, reasoning, review, and cite verification across pinned local models to keep hallucinations out of final documents.
The real story is not the hardware flex; it’s that MoE finally made the local setup usable for real drafting work, while dense models on Volta stayed too slow to justify their footprint.
- –llama.cpp won because the target workload is MoE GGUFs on V100, and the relevant bottleneck is kernel support plus memory behavior, not just raw GPU count
- –The throughput gap is stark: the author reports MoE models in the 50-113 tok/s range, while dense 27B-32B models land below the practical floor
- –The pipeline is doing the important work: a router, a gate model, an adversarial reviewer, and a verifier for cites/dates/Bates numbers matter more than any single model choice
- –The self-poisoning bug is the cautionary lesson here; if your RAG context includes prior outputs, the system will confidently ground on its own slop
- –Keeping the 122B model around is defensible as a high-stakes quality tier, but the 35B MoE looks like the sensible default for routine work
DISCOVERED
4h ago
2026-05-26
PUBLISHED
14h ago
2026-05-25
RELEVANCE
AUTHOR
TumbleweedNew6515