BACK_TO_FEEDAICRIER_2
Unity Android LLM hits 16.6 tok/s
OPEN_SOURCE ↗
REDDIT · REDDIT// 5d agoTUTORIAL

Unity Android LLM hits 16.6 tok/s

This reference project shows how to run an offline LLM inside Unity on Android using llama.cpp’s Adreno OpenCL backend. The setup cuts generation time from 523 seconds to 9 seconds on Snapdragon 8 Gen 3, with Qwen3-1.7B Q8_0 as the final model choice.

// ANALYSIS

This is a deployment win more than a model win: the key lesson is that on mobile GPUs, the fastest quantization is not always the smallest one.

  • The speedup is driven by backend choice, not a new model architecture, which makes this useful for anyone trying to ship edge inference in real apps
  • The Q8_0 vs Q4_0 result is the important takeaway; dequantization overhead can erase the theoretical gains of lower-bit quantization on Adreno
  • The Unity bridge detail matters because native LLM structs are brittle across P/Invoke boundaries; the C wrapper is the difference between a demo and a shippable integration
  • The benchmarks also show how bad the alternatives were in practice: CPU inference was unusable, QNN barely hit the NPU, and Unity’s renderer broke the GPU path for LiteRT-LM
  • This is strong evidence that Android game and app teams should treat llama.cpp + OpenCL as a serious fallback path for offline generation on Snapdragon devices
// TAGS
unity-android-ondevice-llmllmedge-aigpuinferenceopen-source

DISCOVERED

5d ago

2026-04-06

PUBLISHED

5d ago

2026-04-06

RELEVANCE

8/ 10

AUTHOR

Vivid-Usual237