OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoBENCHMARK RESULT
LaBSE Tops Armenian Retrieval, OpenAI Wins EN-RU
A benchmark of 19 embedding runs across 18 checkpoints on 245 trilingual EPG/title triplets and 783 abbreviation pairs found that LaBSE beat paid APIs on Armenian cross-lingual retrieval. OpenAI text-embedding-3-large led EN↔RU but dropped sharply on Armenian, suggesting retrieval metrics matter more than cosine alignment.
// ANALYSIS
The blunt takeaway: for low-resource, non-Latin scripts, the “best” embedding model is the one trained for retrieval on multilingual parallel data, not the newest or most expensive API.
- –LaBSE ranked #1 on retrieval with R@1 0.834 and MRR 0.864, ahead of all paid APIs in the benchmark.
- –OpenAI `text-embedding-3-large` did best on EN↔RU but fell to R@1 0.210 on EN↔HY and RU↔HY, showing poor transfer to Armenian.
- –The post’s strongest point is the alignment-vs-retrieval split: some models look good on mean cosine yet fail at actual nearest-neighbor selection.
- –`e5-large` and `e5-large-v2` are presented as “monolingual traps” for this use case, with inflated cosine and weak retrieval.
- –Cohere `embed-v4.0` is reported to regress versus `embed-multilingual-v3.0` on this task, which is a useful warning against blind model upgrades.
// TAGS
embeddingsmultilingualcross-lingual-retrievallow-resource-languagearmenianepgopen-sourceopenaicoheresentence-transformers
DISCOVERED
4h ago
2026-04-24
PUBLISHED
7h ago
2026-04-24
RELEVANCE
9/ 10
AUTHOR
FigAltruistic2086