llama.cpp adds PDL support for Blackwell inference boost
llama.cpp now supports NVIDIA's Programmatic Dependent Launch (PDL) on Hopper and Blackwell architectures. The opt-in compile flag overlaps CUDA kernel scheduling to reduce idle time, yielding up to a 10% boost in token generation throughput.
This hardware-level synchronization feature is a highly practical win for local inference enthusiasts and deployment environments running on modern NVIDIA silicon.
- –PDL allows secondary kernels to begin scheduling before primary kernels finish, significantly increasing GPU utilization during memory-bound LLM token generation.
- –The feature is currently opt-in and requires compiling with the `-D GGML_CUDA_PDL=ON` flag, meaning many users might miss the performance gain if they rely on default pre-built binaries.
- –While prefill speeds remain largely unaffected, the 5-10% generation speedup is a "free lunch" especially beneficial for large batch inference or continuous serving.
- –Users deploying on Blackwell should ensure they are using CUDA 12.8, as some earlier versions show regressions with specific kernels.
DISCOVERED
4h ago
2026-05-23
PUBLISHED
11h ago
2026-05-22
RELEVANCE
AUTHOR
UncleRedz
