OPEN_SOURCE ↗
PH · PRODUCT_HUNT// 4d agoBENCHMARK RESULT
Sup AI tops Humanity's Last Exam
Sup AI is a multi-model AI ensemble that says it reached 52.15% on Humanity's Last Exam by running 337 models in parallel and scoring confidence at the chunk level. The company frames it as a hallucination-resistant assistant for research, search, and high-stakes answers.
// ANALYSIS
This is a strong benchmark signal, but it is also a product-positioning move: Sup AI is selling orchestration quality, not a single magic model. The result is interesting because it leans on ensemble diversity and confidence filtering, which is a more defensible story than “our model is smarter.”
- –The headline number matters: 52.15% on HLE is positioned as 7.41 points ahead of the next best model in its setup.
- –The benchmark run used web search and custom prompts, so it is not a clean apples-to-apples comparison with raw model scores.
- –The product itself looks closer to an accuracy-first research assistant than a general chatbot, with source transparency, file search, and context compaction as core features.
- –If the claim holds up outside the benchmark, the real moat is routing and verification logic, not model ownership.
- –The risk is obvious: ensemble systems can look great on curated evals while still being hard to trust on messy real-world workflows.
// TAGS
llmreasoningsearchagentbenchmarksup-ai
DISCOVERED
4d ago
2026-04-07
PUBLISHED
5d ago
2026-04-07
RELEVANCE
9/ 10
AUTHOR
[REDACTED]