OPEN_SOURCE ↗
REDDIT · REDDIT// 5d agoTUTORIAL
Unity Android LLM hits 16.6 tok/s
This reference project shows how to run an offline LLM inside Unity on Android using llama.cpp’s Adreno OpenCL backend. The setup cuts generation time from 523 seconds to 9 seconds on Snapdragon 8 Gen 3, with Qwen3-1.7B Q8_0 as the final model choice.
// ANALYSIS
This is a deployment win more than a model win: the key lesson is that on mobile GPUs, the fastest quantization is not always the smallest one.
- –The speedup is driven by backend choice, not a new model architecture, which makes this useful for anyone trying to ship edge inference in real apps
- –The Q8_0 vs Q4_0 result is the important takeaway; dequantization overhead can erase the theoretical gains of lower-bit quantization on Adreno
- –The Unity bridge detail matters because native LLM structs are brittle across P/Invoke boundaries; the C wrapper is the difference between a demo and a shippable integration
- –The benchmarks also show how bad the alternatives were in practice: CPU inference was unusable, QNN barely hit the NPU, and Unity’s renderer broke the GPU path for LiteRT-LM
- –This is strong evidence that Android game and app teams should treat llama.cpp + OpenCL as a serious fallback path for offline generation on Snapdragon devices
// TAGS
unity-android-ondevice-llmllmedge-aigpuinferenceopen-source
DISCOVERED
5d ago
2026-04-06
PUBLISHED
5d ago
2026-04-06
RELEVANCE
8/ 10
AUTHOR
Vivid-Usual237