OPEN_SOURCE ↗
REDDIT · REDDIT// 14h agoBENCHMARK RESULT
PFlash claims 10x 128K prefill speedup
PFlash is an open-source speculative-prefill stack for long-context llama.cpp workloads on consumer GPUs. In the thread, “warmed steady-state” means the model is already loaded and the benchmark is running after warmup, so the 169.3 s figure reflects a hot, resident engine rather than a first-run cold start.
// ANALYSIS
Hot take: “warmed steady-state” is benchmark shorthand for “the process is already hot,” not “KV cache makes the whole prompt free.” It strips out startup overhead, but the 128K prefill tax is still there.
- –In llama.cpp terms, warm usually means the model, GPU kernels, allocator state, and OS/page caches are already initialized, so the next run avoids first-request penalties.
- –KV caching helps when the same prefix is reused across turns in a live chat session; that is why multi-turn conversations often feel much faster after the initial exchange.
- –The 248 s cold number is first-request pain; the 169 s warmed number is the more realistic steady-state cost for a server that stays up.
- –Even warmed, the runtime is still dominated by long-context prefill, so the quadratic scaling problem remains the real bottleneck.
- –If you are evaluating UX, compare cold numbers for “first load” experience and warmed numbers for “always-on service” experience.
// TAGS
pflashllmlong-contextinferencebenchmarkgpuopen-source
DISCOVERED
14h ago
2026-05-02
PUBLISHED
15h ago
2026-05-02
RELEVANCE
8/ 10
AUTHOR
alex20_202020