PFlash claims 10x 128K prefill speedup
PFlash is an open-source speculative-prefill stack for long-context llama.cpp workloads on consumer GPUs. In the thread, “warmed steady-state” means the model is already loaded and the benchmark is running after warmup, so the 169.3 s figure reflects a hot, resident engine rather than a first-run cold start.
Hot take: “warmed steady-state” is benchmark shorthand for “the process is already hot,” not “KV cache makes the whole prompt free.” It strips out startup overhead, but the 128K prefill tax is still there.
- –In llama.cpp terms, warm usually means the model, GPU kernels, allocator state, and OS/page caches are already initialized, so the next run avoids first-request penalties.
- –KV caching helps when the same prefix is reused across turns in a live chat session; that is why multi-turn conversations often feel much faster after the initial exchange.
- –The 248 s cold number is first-request pain; the 169 s warmed number is the more realistic steady-state cost for a server that stays up.
- –Even warmed, the runtime is still dominated by long-context prefill, so the quadratic scaling problem remains the real bottleneck.
- –If you are evaluating UX, compare cold numbers for “first load” experience and warmed numbers for “always-on service” experience.
DISCOVERED
47d ago
2026-05-02
PUBLISHED
47d ago
2026-05-02
RELEVANCE
AUTHOR
alex20_202020