TurboQuant fits 70B Llama on Mac

// 90d agoRESEARCH PAPER

TurboQuant fits 70B Llama on Mac

TurboQuant for Llama 3.1 70B is a new research/code release showing Llama 3.1 70B running at 128K context on a 64GB Apple Silicon Mac by using fused int4 KV-cache attention in Metal. The project reports 40GB to 12.5GB KV-cache reduction and a 48x attention-kernel speedup at 128K versus dequantize-then-attend.

// ANALYSIS

This is a sharp local-inference result because it attacks the real long-context bottleneck: KV memory, not just model weights.

–The headline win is practical: a 4-bit 70B model plus 128K context reportedly drops from about 79GB to 53.6GB, making it fit on 64GB Macs
–The fused Metal kernel matters because it computes attention directly over compressed int4 keys/values instead of rebuilding large temporary matrices
–The 48x figure is attention-kernel speedup, not full end-to-end model speedup; the repo still reports around 6 tok/s decode
–The paper’s negative results are useful too: QJL and PolarQuant ideas did not cleanly transfer to this 70B Apple Silicon setup, while simpler int4 won in practice
–This is early, low-star research code, but it points toward consumer long-context inference becoming less absurdly hardware-gated

// TAGS

turboquant-for-llama-3.1-70bturboquantllama-3.1-70bllminferenceedge-aiopen-sourceresearch

DISCOVERED

90d ago

2026-04-22

PUBLISHED

90d ago

2026-04-21

RELEVANCE

9/ 10

AUTHOR

Zealousideal_Cat1508

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

MODEL27m ago

Google launches Gemini 3.5 Flash-Lite

Google launched Gemini 3.5 Flash-Lite, its fastest and most cost-effective 3.5 model delivering speeds up to 350 output tokens per second. Optimized for low-latency agentic search and bulk document processing, it is available via Gemini API and Google AI Studio starting at $0.30 per million input tokens.

NEWS35m ago

ElevenLabs Skills crosses 40,000 installations

ElevenLabs Skills has crossed 40,000 installations, offering open-source building blocks to integrate voice, music, and sound capabilities into AI agents. Developers can install the library directly using the npx skills command.

UPDATE39m ago

Vercel AI Gateway adds Gemini Flash models

Vercel has integrated Google's Gemini 3.6 Flash and Gemini 3.5 Flash-Lite into its AI Gateway, enabling developers to query these models via the Vercel AI SDK. This update brings improved agentic workflows, coding capabilities, and cost efficiency to the gateway with unified routing, tracking, and retries.