Spiral drops INT3 compression, Metal kernels for Mac

// 90d agoOPENSOURCE RELEASE

Spiral drops INT3 compression, Metal kernels for Mac

Spiral is a high-performance model compression framework by ReinforceAI designed for local LLM inference on Apple Silicon. It combines novel INT3 quantization with custom-fused Metal kernels and 2-bit KV cache optimization to enable running large models like Qwen 7B with minimal accuracy loss on consumer-grade Mac hardware.

// ANALYSIS

Spiral pushes local inference on Apple Silicon into highly optimized INT3 territory, significantly outperforming standard 4-bit quantization in memory efficiency for long-horizon tasks.

–Geometric compression approach achieves SOTA INT3 quality with only +0.14 nats perplexity cost, a claimed 101x improvement over naive 3-bit methods.
–Custom fused Metal kernels integrate PQ decode, RoPE, and attention into a single pass, optimizing prefill and decode speeds on Mac GPUs.
–The 2-bit KV cache implementation scales context capacity by 1.75x, allowing an 8GB Mac to handle context windows up to 113K tokens.
–Calibration-free workflow eliminates the need for fine-tuning or gradient updates, simplifying the deployment of compressed models for researchers and developers.
–Currently in preview with Qwen 7B support; future updates aim to support models up to 100B parameters and add Triton kernels for GPU support.

// TAGS

spiralllmquantizationedge-aigpuopen-sourcemacosapple-silicon

DISCOVERED

90d ago

2026-04-22

PUBLISHED

90d ago

2026-04-22

RELEVANCE

8/ 10

AUTHOR

Financial_Buy_2287

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE27m ago

Netlify adds Gemini 3.6 Flash to AI Gateway

Netlify has expanded its AI infrastructure by adding support for Google's Gemini 3.6 Flash and Gemini 3.5 Flash-Lite models across Netlify AI Gateway and Agent Runners. Developers can now call these models directly from Netlify Functions without configuring API keys or managing separate provider accounts.

INFRA1h ago

Ship Brings JIT Compilation to LLM Inference

Traditional LLM deployment relies on fixed model weights produced during training before any user request exists. Introducing a Just-In-Time (JIT) compilation concept for LLM inference allows the system to evaluate incoming requests on-the-fly and construct custom execution plans tailored to each request's specific computational requirements.

VIDEO1h ago

Semrush MCP connects Claude Code to live SEO

Semrush MCP is a Model Context Protocol server that connects AI assistants such as Claude Code directly to Semrush's live SEO, keyword, and competitive intelligence databases. By enabling natural language queries within AI developer environments, users can seamlessly extract domain analytics, backlink insights, and keyword data to automate content creation and programmatic SEO workflows without manual data exports.