YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

PFlash tops llama.cpp, 10x on 3090

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

PFlash tops llama.cpp, 10x on 3090
OPEN LINK ↗
// 50d agoBENCHMARK RESULT

PFlash tops llama.cpp, 10x on 3090

PFlash is an in-process speculative prefill stack for long-context LLM inference that cuts Qwen3.6-27B 128K TTFT on an RTX 3090 to 24.8 seconds, roughly 10x faster than vanilla llama.cpp. It stays C++/CUDA-only at runtime and claims end-to-end retrieval preservation on NIAH single-needle tests.

// ANALYSIS

This is a legit systems engineering win, but the headline number is tightly scoped: long-context, single-GPU, consumer-VRAM, and a benchmark setup that favors attention-based token selection. The interesting part is not just the speedup, but that they composed two separate papers into one in-process runtime.

  • The result is stronger than a flash-attention tweak: FlashPrefill makes the drafter cheap enough at 128K, while speculative prefill trims the target’s work by skipping low-value spans.
  • The benchmark is directionally credible because it reports both cold-cache and warmed numbers, plus an end-to-end retrieval check, but NIAH single-needle is still an easy case for this kind of selector.
  • The main practical constraint is VRAM choreography: parking and unparking weights is part of the pipeline, so real-world latency depends on whether you keep the engine resident.
  • This is useful for people shipping 27B-class local inference on 24 GB cards, but it is not a drop-in llama.cpp replacement and tuning keep_ratio / alpha will matter a lot.
  • The repo matters as much as the benchmark: an open C++/CUDA implementation lowers the gap between research ideas and something practitioners can actually test.
// TAGS
benchmarkinferencelong-contextquantizationgpuopen-sourcellmpflash

DISCOVERED

50d ago

2026-05-01

PUBLISHED

50d ago

2026-05-01

RELEVANCE

9/ 10

AUTHOR

sandropuppo