llama-server tweaks boost local LLM serving speed

// 62d agoTUTORIAL

llama-server tweaks boost local LLM serving speed

A LocalLLaMA user shared optimization tips for running llama-server on systems with limited VRAM. Running with the -np 1 flag and disabling browser hardware acceleration frees up VRAM to significantly boost tokens-per-second output.

// ANALYSIS

Simple configuration tweaks can dramatically improve local LLM inference speeds on consumer hardware by efficiently managing limited VRAM resources. The llama-server defaults to allocating context for multiple concurrent clients, but using the -np 1 flag reserves VRAM strictly for a single user. Furthermore, disabling hardware acceleration in background applications like Firefox releases chunks of VRAM that the OS otherwise reserves for web page rendering. Combined, these adjustments can yield over a 20% increase in tokens per second, allowing users to run larger quantizations or achieve higher throughput on mainstream GPUs.

// TAGS

llama-cppllama-serveroptimizationlocal-llmvramperformance

DISCOVERED

62d ago

2026-03-26

PUBLISHED

62d ago

2026-03-26

RELEVANCE

6/ 10

AUTHOR

ea_man

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE3m ago

Flashlib brings Triton speed to classical ML

Flashlib is a GPU-accelerated library for classical machine learning operators like K-Means and PCA, built on Triton for maximum hardware efficiency. It features a unique predictive API that estimates runtime and memory usage in microseconds, enabling AI agents to budget workloads before execution.

UPDATE49m ago

Supabase Auth opens Passkeys public beta

Supabase has opened the Passkeys public beta to all projects, enabling passwordless, phishing-resistant logins via biometrics and hardware keys. Built on the WebAuthn standard, the feature supports discoverable credentials for a "username-less" sign-in experience.

INFRA54m ago

Hippocratic AI hits 99.9% safety on NVIDIA Blackwell

Hippocratic AI achieved 99.9% clinical safety and a 2x prefill speedup using DigitalOcean’s NVIDIA Blackwell-powered AI-Native Cloud. The collaboration demonstrates the real-world performance gains of the HGX B300 for high-concurrency, safety-critical medical agents.