Gemma 4 MLX setup hits 19 tok/s

// 54d agoBENCHMARK RESULT

Gemma 4 MLX setup hits 19 tok/s

A Reddit post describes a minimal local Gemma 4 chat UI built with MLX and Flask for Apple Silicon Macs. The author reports about 19 tokens per second on an M4 MacBook with 16GB RAM and asks whether a 4-bit version can hold up in longer contexts.

// ANALYSIS

Strong niche utility for people who want a no-frills local LLM setup on Apple Silicon, but it reads more like a practical benchmark note than a polished product launch.

–The main signal is performance: ~19 tok/s on an M4 MacBook with 16GB is credible and useful for local-model shoppers.
–The setup choice matters: Flask + plain HTML lowers complexity and makes the workflow easier to reproduce than a full desktop app stack.
–Passing full conversation history each turn is good for narrative work, but it will pressure memory and context efficiency as chats grow.
–The 4-bit question is the real open item; long-context behavior is where these lightweight local setups usually start to trade quality for speed.

// TAGS

gemmamlxlocal-llmapple-siliconmacbook-m4flaskllm-inferencequantization

DISCOVERED

54d ago

2026-04-04

PUBLISHED

54d ago

2026-04-04

RELEVANCE

7/ 10

AUTHOR

Polstick1971

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE27m ago

Claude Code 2.1.154 teases CLI fixes

The Claude Code X account says version 2.1.154 is about to be released, signaling another small maintenance update in Anthropic’s fast-moving CLI cadence. Recent Claude Code releases have focused on reliability, model-picker fixes, MCP handling, background-session polish, and other workflow rough edges, so this looks like a refinement patch rather than a major feature milestone.

MODEL31m ago

ElevenLabs Dubbing v2 keeps emotion intact

ElevenLabs says Dubbing v2 carries over the original performance, not just the transcript, across 90+ languages. The pitch is sync-aware phrasing and delivery that sounds acted, not machine-translated, for creators, marketers, and production teams.

MODEL53m ago

Gemini 3.5 Flash powers Archon UI design

Google's latest 3.5 Flash model integrates with the Archon coding harness to deliver high-fidelity frontend designs via specialized agentic workflows. The model features a 1M context window and optimized reasoning for autonomous, multi-step development tasks.