Document Redaction App benchmarks local agents
The Document Redaction App team benchmarked agent workflows on a seven-page redaction-and-review task using OCR and PII detection, comparing Sonnet 4.6, Composer 2.0, Qwen 3.6, and Kimi 2.5. The key result: the workflow is automatable end to end, but output quality still varies too much for unsupervised use.
The real takeaway is not that local agents are “good enough” yet, but that the entire redaction workflow is now machine-executable on consumer hardware. That is a meaningful threshold, even if human review remains non-negotiable.
- –Sonnet 4.6 was the most reliable, which matches the pattern that redaction is less about raw intelligence than disciplined tool use and visual accuracy
- –Qwen 3.6 completing the workflow locally on 24GB VRAM is the important systems signal: private redaction pipelines are becoming practical, even if output quality is still rough
- –Signature handling exposed the weakest point across models, because OCR plus spatial placement is where sloppy agents fail fastest
- –Composer 2.0 beating Kimi 2.5 shows that fine-tuning and instruction-following matter as much as base-model scale in agentic document work
- –This is a benchmark for a workflow, not a finished product: the app plus skill stack matters as much as the model choice
DISCOVERED
45d ago
2026-04-27
PUBLISHED
45d ago
2026-04-27
RELEVANCE
AUTHOR
Sonnyjimmy