OPEN_SOURCE ↗
REDDIT · REDDIT// 25d agoINFRASTRUCTURE
LightOnOCR-2 memory spike sparks GLM-OCR caution
A LocalLLaMA user reports OOM on a 16GB M4 MacBook Air with LightOnOCR-2 in Transformers and about 40GB total allocation (11GB VRAM + 30GB RAM) in vLLM after prompting, then asks whether GLM-OCR SDK will behave the same. The post highlights a practical deployment gap between small parameter counts and real multimodal OCR inference memory usage.
// ANALYSIS
Hot take: this looks more like expected multimodal inference behavior than a setup mistake, especially once generation starts and caches explode.
- –Post-prompt spikes are common in OCR VLMs because image tokens, long outputs, and KV cache can dominate memory more than raw model weights.
- –LightOnOCR-2’s own guidance emphasizes rendering constraints and vLLM runtime flags, which suggests serving/config choices strongly affect peak memory.
- –GLM-OCR is marketed as a compact 0.9B model with efficiency-focused decoding, but it is still in the same document-VLM class and can spike on large pages or long outputs.
- –On 16-18GB unified-memory laptops, reliable local runs usually require stricter page batching, lower pixel budgets, and tighter token limits.
// TAGS
lightonocr-2glm-ocrmultimodalinferencegpusdkvllm
DISCOVERED
25d ago
2026-03-17
PUBLISHED
25d ago
2026-03-17
RELEVANCE
7/ 10
AUTHOR
ShOkerpop