Unity Android LLM hits 16.6 tok/s

// 97d agoTUTORIAL

Unity Android LLM hits 16.6 tok/s

This reference project shows how to run an offline LLM inside Unity on Android using llama.cpp’s Adreno OpenCL backend. The setup cuts generation time from 523 seconds to 9 seconds on Snapdragon 8 Gen 3, with Qwen3-1.7B Q8_0 as the final model choice.

// ANALYSIS

This is a deployment win more than a model win: the key lesson is that on mobile GPUs, the fastest quantization is not always the smallest one.

–The speedup is driven by backend choice, not a new model architecture, which makes this useful for anyone trying to ship edge inference in real apps
–The Q8_0 vs Q4_0 result is the important takeaway; dequantization overhead can erase the theoretical gains of lower-bit quantization on Adreno
–The Unity bridge detail matters because native LLM structs are brittle across P/Invoke boundaries; the C wrapper is the difference between a demo and a shippable integration
–The benchmarks also show how bad the alternatives were in practice: CPU inference was unusable, QNN barely hit the NPU, and Unity’s renderer broke the GPU path for LiteRT-LM
–This is strong evidence that Android game and app teams should treat llama.cpp + OpenCL as a serious fallback path for offline generation on Snapdragon devices

// TAGS

unity-android-ondevice-llmllmedge-aigpuinferenceopen-source

DISCOVERED

97d ago

2026-04-06

PUBLISHED

97d ago

2026-04-06

RELEVANCE

8/ 10

AUTHOR

Vivid-Usual237

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE15m ago

OpenAI restores ChatGPT on WhatsApp in EEA

OpenAI has restored ChatGPT access on WhatsApp for users in the European Economic Area (EEA) via a verified contact number. Users can interact with the AI assistant in multiple languages, send voice notes, upload images, and generate new media directly within the chat.

BENCHMARK49m ago

Grok 4.5 tops SWE-Atlas-QnA benchmark

xAI's frontier AI model, Grok 4.5, has achieved the top ranking on Scale AI's SWE-Atlas-QnA benchmark. While individual benchmark supremacy is often short-lived, the result highlights the rapid, iterative pace of top-tier AI models pushing each other forward in complex, codebase-level question answering and developer agent capabilities.

OPEN SOURCE1h ago

Win11Debloat declutters Windows 10 and 11

Win11Debloat is a lightweight, customizable PowerShell script to declutter, optimize, and customize Windows 10 and 11. It allows users to remove pre-installed bloatware apps, disable telemetry, adjust privacy settings, and tweak user interface elements through an interactive menu or command-line arguments.