REDDIT · REDDIT// 14h agoBENCHMARK RESULT

PFlash claims 10x 128K prefill speedup

PFlash is an open-source speculative-prefill stack for long-context llama.cpp workloads on consumer GPUs. In the thread, “warmed steady-state” means the model is already loaded and the benchmark is running after warmup, so the 169.3 s figure reflects a hot, resident engine rather than a first-run cold start.

// ANALYSIS

Hot take: “warmed steady-state” is benchmark shorthand for “the process is already hot,” not “KV cache makes the whole prompt free.” It strips out startup overhead, but the 128K prefill tax is still there.

–In llama.cpp terms, warm usually means the model, GPU kernels, allocator state, and OS/page caches are already initialized, so the next run avoids first-request penalties.
–KV caching helps when the same prefix is reused across turns in a live chat session; that is why multi-turn conversations often feel much faster after the initial exchange.
–The 248 s cold number is first-request pain; the 169 s warmed number is the more realistic steady-state cost for a server that stays up.
–Even warmed, the runtime is still dominated by long-context prefill, so the quadratic scaling problem remains the real bottleneck.
–If you are evaluating UX, compare cold numbers for “first load” experience and warmed numbers for “always-on service” experience.

// TAGS

pflashllmlong-contextinferencebenchmarkgpuopen-source

DISCOVERED

14h ago

2026-05-02

PUBLISHED

15h ago

2026-05-02

RELEVANCE

8/ 10

AUTHOR

alex20_202020