PTD cuts VRAM, speeds long context
Physical Token Dropping is a new open-source sparse transformer proof-of-concept that physically drops low-scored token segments during block execution, shipping with code and a Hugging Face Qwen2.5-0.5B keep-70 variant. The reported tradeoff is notable for long-context inference: up to 72.11% lower latency and 85.56% lower peak VRAM at 8K context, with modest quality loss on this small model.
This is the kind of scrappy inference optimization work AI developers actually care about: not a bigger model, but a concrete attempt to make long-context generation cheaper on commodity hardware. The catch is that it is still an early proof-of-concept on a 0.5B Qwen base, so the real question is whether the gains survive at larger scales and broader evals.
- –PTD attacks one of the most painful bottlenecks in local LLM work: KV-cache growth and long-context memory pressure
- –Shipping both the GitHub implementation and a Hugging Face model makes it easier for developers to inspect the method instead of treating it as a vague benchmark claim
- –The reported 4K and 8K results look strong enough to earn attention, especially the VRAM reductions, but the accuracy tradeoff at 8K shows this is not a free win
- –Because it relies on custom routing and remote code, adoption will depend on how cleanly it integrates into existing Transformers and inference workflows
DISCOVERED
32d ago
2026-03-10
PUBLISHED
32d ago
2026-03-10
RELEVANCE
AUTHOR
Repulsive_Ad_94