BACK_TO_FEEDAICRIER_2
PFlash tops llama.cpp, 10x on 3090
OPEN_SOURCE ↗
REDDIT · REDDIT// 1d agoBENCHMARK RESULT

PFlash tops llama.cpp, 10x on 3090

PFlash is an in-process speculative prefill stack for long-context LLM inference that cuts Qwen3.6-27B 128K TTFT on an RTX 3090 to 24.8 seconds, roughly 10x faster than vanilla llama.cpp. It stays C++/CUDA-only at runtime and claims end-to-end retrieval preservation on NIAH single-needle tests.

// ANALYSIS

This is a legit systems engineering win, but the headline number is tightly scoped: long-context, single-GPU, consumer-VRAM, and a benchmark setup that favors attention-based token selection. The interesting part is not just the speedup, but that they composed two separate papers into one in-process runtime.

  • The result is stronger than a flash-attention tweak: FlashPrefill makes the drafter cheap enough at 128K, while speculative prefill trims the target’s work by skipping low-value spans.
  • The benchmark is directionally credible because it reports both cold-cache and warmed numbers, plus an end-to-end retrieval check, but NIAH single-needle is still an easy case for this kind of selector.
  • The main practical constraint is VRAM choreography: parking and unparking weights is part of the pipeline, so real-world latency depends on whether you keep the engine resident.
  • This is useful for people shipping 27B-class local inference on 24 GB cards, but it is not a drop-in llama.cpp replacement and tuning keep_ratio / alpha will matter a lot.
  • The repo matters as much as the benchmark: an open C++/CUDA implementation lowers the gap between research ideas and something practitioners can actually test.
// TAGS
benchmarkinferencelong-contextquantizationgpuopen-sourcellmpflash

DISCOVERED

1d ago

2026-05-01

PUBLISHED

1d ago

2026-05-01

RELEVANCE

9/ 10

AUTHOR

sandropuppo