BACK_TO_FEEDAICRIER_2
LaBSE Tops Armenian Retrieval, OpenAI Wins EN-RU
OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoBENCHMARK RESULT

LaBSE Tops Armenian Retrieval, OpenAI Wins EN-RU

A benchmark of 19 embedding runs across 18 checkpoints on 245 trilingual EPG/title triplets and 783 abbreviation pairs found that LaBSE beat paid APIs on Armenian cross-lingual retrieval. OpenAI text-embedding-3-large led EN↔RU but dropped sharply on Armenian, suggesting retrieval metrics matter more than cosine alignment.

// ANALYSIS

The blunt takeaway: for low-resource, non-Latin scripts, the “best” embedding model is the one trained for retrieval on multilingual parallel data, not the newest or most expensive API.

  • LaBSE ranked #1 on retrieval with R@1 0.834 and MRR 0.864, ahead of all paid APIs in the benchmark.
  • OpenAI `text-embedding-3-large` did best on EN↔RU but fell to R@1 0.210 on EN↔HY and RU↔HY, showing poor transfer to Armenian.
  • The post’s strongest point is the alignment-vs-retrieval split: some models look good on mean cosine yet fail at actual nearest-neighbor selection.
  • `e5-large` and `e5-large-v2` are presented as “monolingual traps” for this use case, with inflated cosine and weak retrieval.
  • Cohere `embed-v4.0` is reported to regress versus `embed-multilingual-v3.0` on this task, which is a useful warning against blind model upgrades.
// TAGS
embeddingsmultilingualcross-lingual-retrievallow-resource-languagearmenianepgopen-sourceopenaicoheresentence-transformers

DISCOVERED

4h ago

2026-04-24

PUBLISHED

7h ago

2026-04-24

RELEVANCE

9/ 10

AUTHOR

FigAltruistic2086