YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Local Legal Stack Goes MoE

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Local Legal Stack Goes MoE
OPEN LINK ↗
// 4h agoINFRASTRUCTURE

Local Legal Stack Goes MoE

A lawyer updated a self-hosted legal drafting system built around 12 V100s, a second GPU box, and llama.cpp after moving away from vLLM for the models he actually wants to run. The stack now routes drafting, reasoning, review, and cite verification across pinned local models to keep hallucinations out of final documents.

// ANALYSIS

The real story is not the hardware flex; it’s that MoE finally made the local setup usable for real drafting work, while dense models on Volta stayed too slow to justify their footprint.

  • llama.cpp won because the target workload is MoE GGUFs on V100, and the relevant bottleneck is kernel support plus memory behavior, not just raw GPU count
  • The throughput gap is stark: the author reports MoE models in the 50-113 tok/s range, while dense 27B-32B models land below the practical floor
  • The pipeline is doing the important work: a router, a gate model, an adversarial reviewer, and a verifier for cites/dates/Bates numbers matter more than any single model choice
  • The self-poisoning bug is the cautionary lesson here; if your RAG context includes prior outputs, the system will confidently ground on its own slop
  • Keeping the 122B model around is defensible as a high-stakes quality tier, but the 35B MoE looks like the sensible default for routine work
// TAGS
llmmoelong-contextinferencegpuragself-hostedlocal-legal-drafting-stack

DISCOVERED

4h ago

2026-05-26

PUBLISHED

14h ago

2026-05-25

RELEVANCE

8/ 10

AUTHOR

TumbleweedNew6515