TileRT optimizes Large Language Model execution on GPUs by using persistent kernels to minimize microsecond-scale execution gaps and enable ultra-low-latency serving.
TileRT is a tile-level runtime engine developed in collaboration with Xiaomi that optimizes GPU execution for LLMs by replacing traditional per-operator launches with persistent kernels. This approach eliminates microsecond-scale execution gaps, sustaining high token throughput and ultra-low latency on commodity hardware.
Eliminating operator launch overheads via persistent kernels on commodity hardware shifts the LLM serving bottleneck from computation limits to memory and execution efficiency.
- –Persistent kernels reduce CPU-GPU communication overhead, which is critical for low-batch, high-speed interactive generation.
- –Co-development with Xiaomi indicates a high priority on optimizing LLMs for edge-adjacent or consumer-grade hardware architectures.
- –Moving from per-operator optimization to tile-level runtime orchestration represents a mature step in deep learning compilation.
DISCOVERED
2d ago
2026-06-14
PUBLISHED
2d ago
2026-06-14
RELEVANCE
AUTHOR
Better Stack