llama.cpp cold-starts A770 servers on Windows 11

// 90d agoINFRASTRUCTURE

llama.cpp cold-starts A770 servers on Windows 11

A Reddit user reports that a llama.cpp server on Windows 11 with an Intel Arc A770 and dual Xeon C612 platform repeatedly unloads the model after idle periods, even with BIOS and Windows power-saving options disabled. When the next API request arrives, the model can take a long time to reload in chunks, suggesting a cold-start path rather than normal steady-state inference. The only reliable workaround they found was a polling script that sends a tiny completion every 30 seconds to keep the server active.

// ANALYSIS

This looks more like an idle-path or driver/runtime issue than a simple display-output problem.

–The slow “reload in gigabyte increments” points to model state being paged out or reinitialized, not just the GPU entering a light sleep state.
–Intel Arc on Windows plus an older workstation chipset is a plausible edge-case combination for PCIe power management or driver behavior.
–The polling workaround is practical, but it is a keepalive hack, not a fix.
–For operators, the real question is whether llama.cpp is unloading on its own, or whether Windows/driver behavior is forcing the backend to rehydrate VRAM and host memory after idleness.

// TAGS

llama-cppintel arca770windowsgpullm servingapiidlekeepalive

DISCOVERED

90d ago

2026-04-18

PUBLISHED

90d ago

2026-04-18

RELEVANCE

5/ 10

AUTHOR

Turbulent-Attorney65

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE42m ago

Orca Mobile launches agent chat UI

Orca has released its Chat UI for Orca Mobile in beta for iOS and Android, allowing developers to monitor and control desktop AI coding agents remotely. Developed with RunFusion, the update introduces a free mobile relay service that eliminates the need for a Tailscale setup.

FUNDING48m ago

After Labs emerges from stealth with funding

After Labs is a newly unveiled AI research lab focusing on the development of efficient fluid intelligence. The startup has officially come out of stealth after securing funding, drawing congratulations from prominent AI figures including François Chollet for founders Clem and Matt.

MODEL1h ago

Thinking Machines' Inkling model hits OpenRouter

Thinking Machines Lab has made their new open-weights Mixture-of-Experts (MoE) model, Inkling, available on OpenRouter. The model features 975 billion total and 41 billion active parameters, supports a 1 million token context window, and provides controllable reasoning across text, images, and audio.