REDDIT · REDDIT// 4h agoBENCHMARK RESULT

Document Redaction App benchmarks local agents

The Document Redaction App team benchmarked agent workflows on a seven-page redaction-and-review task using OCR and PII detection, comparing Sonnet 4.6, Composer 2.0, Qwen 3.6, and Kimi 2.5. The key result: the workflow is automatable end to end, but output quality still varies too much for unsupervised use.

// ANALYSIS

The real takeaway is not that local agents are “good enough” yet, but that the entire redaction workflow is now machine-executable on consumer hardware. That is a meaningful threshold, even if human review remains non-negotiable.

–Sonnet 4.6 was the most reliable, which matches the pattern that redaction is less about raw intelligence than disciplined tool use and visual accuracy
–Qwen 3.6 completing the workflow locally on 24GB VRAM is the important systems signal: private redaction pipelines are becoming practical, even if output quality is still rough
–Signature handling exposed the weakest point across models, because OCR plus spatial placement is where sloppy agents fail fastest
–Composer 2.0 beating Kimi 2.5 shows that fine-tuning and instruction-following matter as much as base-model scale in agentic document work
–This is a benchmark for a workflow, not a finished product: the app plus skill stack matters as much as the model choice

// TAGS

document-redaction-appagentmultimodalopen-sourceself-hostedautomationllm

DISCOVERED

4h ago

2026-04-27

PUBLISHED

7h ago

2026-04-27

RELEVANCE

8/ 10

AUTHOR

Sonnyjimmy