Cascaded Local Agent splits routing from synthesis
This is a personal local-LLM agent project that splits inference across two devices to keep the main GPU free for final synthesis. A Lenovo Legion Go runs the lightweight routing, embeddings, semantic search, and knowledge-graph extraction models, while an RTX 4060 laptop only invokes Qwen 3.5 9B once per query to produce the final answer. The post claims this architecture cuts a three-step research flow from roughly two minutes to about 35 seconds, while also reducing fan noise and thermal load.
The core idea is solid: keep cheap, repetitive control-flow on a small model and reserve the bigger model for the one step that actually benefits from higher-quality synthesis.
- –The split is pragmatic, not flashy: ReAct dispatch is mostly classification and pattern matching, so it can run well on a small edge model.
- –Offloading embeddings and fact extraction to the handheld device makes the laptop’s discrete GPU available only when it matters.
- –The reported speedup is plausible if the old setup was serializing every step through the 9B model.
- –The thermal benefit is as important as latency here; a cold, uncontended GPU is a better user experience than raw peak throughput.
- –The next obvious experiment is moving more of the reasoning loop to the small device and comparing quality/latency against a larger MoE option.
DISCOVERED
2d ago
2026-04-09
PUBLISHED
2d ago
2026-04-09
RELEVANCE
AUTHOR
lightcaptainguy3364