Qwen3.5-24B REAP squeezes agentic coding into 16GB
A LocalLLaMA contributor released a 32% expert-pruned GGUF variant of Qwen3.5-35B-A3B aimed at coding and agentic workflows on lower-VRAM hardware. The release includes quantized checkpoints, pruning/quantization scripts, and a reproducible Modal pipeline.
This is a practical community optimization drop, not a new base model, but it materially lowers the barrier to running strong MoE coding models locally.
- –The model trims experts from 256 to 175 while keeping ~3B active parameters per token, targeting better memory efficiency.
- –The recommended IQ4_K_S GGUF is positioned for 16GB-class GPUs, which is the core value proposition here.
- –The author shares full replication assets (REAP fork + Modal scripts), making this useful for other quantizers and pruning experiments.
- –Calibration limits (1024 context and memory pressure during profiling) suggest further quality/perf gains are still possible.
DISCOVERED
83d ago
2026-03-05
PUBLISHED
83d ago
2026-03-04
RELEVANCE
AUTHOR
tubuntu2