REDDIT · REDDIT// 32d agoINFRASTRUCTURE

Qwen3.5 397B tests DGX Spark limits

A LocalLLaMA thread asks whether Unsloth's roughly 115GB 2-bit quant of Qwen3.5-397B-A17B can realistically run on NVIDIA's 128GB DGX Spark and what token throughput users should expect. It is less a product announcement than a practical sizing debate about how far aggressive quantization can push frontier-scale local inference.

// ANALYSIS

This is a useful reality check for the "desktop supercomputer" pitch around local AI hardware: fitting a giant model on paper is not the same as running it well. The interesting question is not just model file size, but how much headroom remains for KV cache, runtime overhead, and usable context.

–DGX Spark's 128GB unified memory puts a 115GB 2-bit quant right on the edge, leaving very little safety margin for real inference workloads
–Unsloth's own Qwen3.5 guidance lists much higher total memory needs for 3-bit and 4-bit variants, which suggests 2-bit is a survival tactic, not a comfort zone
–Token speed will vary sharply with context length, offloading strategy, and cache precision, so "it loads" and "it is usable" are two different outcomes
–The thread matters because local AI builders increasingly care more about deployment economics than raw benchmark bragging rights

// TAGS

qwen3.5-397b-a17bllminferencegpuself-hosted

DISCOVERED

32d ago

2026-03-11

PUBLISHED

33d ago

2026-03-09

RELEVANCE

6/ 10

AUTHOR

aiko929