Human review finds errors in clean TranslateGemma subtitles
This follow-up benchmark argues that TranslateGemma’s strong automatic scores may overstate real subtitle quality in certain cases. The authors re-checked 84 translation segments that both MetricX-24 and COMETKiwi had marked as clean, then had professional linguists annotate them with MQM. Even in this high-confidence zone, most segments were flagged by humans, including many accuracy errors that the metrics completely missed. The post is careful about scope, but it raises a real question about whether reference-free QE metrics are too forgiving for subtitle translation.
Strong signal that the dashboard’s clean threshold is not enough to trust without human QA.
- –The headline finding is the metric blind spot: 71% of auto-clean segments had some human-found error in this sample.
- –Accuracy is the main concern, since all 25 accuracy-class errors landed in the blind quadrant.
- –Japanese stands out as the weakest point in the sample despite having the highest mean COMETKiwi score.
- –The sample is small and narrow, so this should be read as a warning about calibration, not a universal indictment of the model or metrics.
DISCOVERED
1d ago
2026-05-12
PUBLISHED
2d ago
2026-05-12
RELEVANCE
AUTHOR
ritis88