Sup AI tops Humanity's Last Exam

// 53d agoBENCHMARK RESULT

Sup AI tops Humanity's Last Exam

Sup AI is a multi-model AI ensemble that says it reached 52.15% on Humanity's Last Exam by running 337 models in parallel and scoring confidence at the chunk level. The company frames it as a hallucination-resistant assistant for research, search, and high-stakes answers.

// ANALYSIS

This is a strong benchmark signal, but it is also a product-positioning move: Sup AI is selling orchestration quality, not a single magic model. The result is interesting because it leans on ensemble diversity and confidence filtering, which is a more defensible story than “our model is smarter.”

–The headline number matters: 52.15% on HLE is positioned as 7.41 points ahead of the next best model in its setup.
–The benchmark run used web search and custom prompts, so it is not a clean apples-to-apples comparison with raw model scores.
–The product itself looks closer to an accuracy-first research assistant than a general chatbot, with source transparency, file search, and context compaction as core features.
–If the claim holds up outside the benchmark, the real moat is routing and verification logic, not model ownership.
–The risk is obvious: ensemble systems can look great on curated evals while still being hard to trust on messy real-world workflows.

// TAGS

llmreasoningsearchagentbenchmarksup-ai

DISCOVERED

53d ago

2026-04-07

PUBLISHED

53d ago

2026-04-07

RELEVANCE

9/ 10

AUTHOR

[REDACTED]

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE16h ago

Claude Code defaults to Opus 4.8

Claude Code v2.1.154 promotes Opus 4.8 to the default high-effort model, adds dynamic workflows that can orchestrate work across dozens to hundreds of background agents, and improves fast mode economics and speed on Opus 4.8. The release also refines cleanup flows with a lighter `/simplify` path, renames effort labels for clarity, and tightens several CLI and agent workflows for heavier terminal-based coding sessions.

TUTORIAL16h ago

Unstract tutorial covers local setup

This YouTube walkthrough shows how to self-host Unstract, the open-source document extraction platform, with Docker and local model support. It positions the tool as a practical fit for offline and private RAG-style workflows that turn PDFs and other files into structured outputs.

NEWS16h ago

Uber's Claude Code bill tests AI ROI

The video uses Uber’s reported Claude Code spend as a concrete example of the rising tension around agentic coding tools: usage can scale quickly inside engineering teams, but leadership is still struggling to connect that spend to shipped consumer features. It frames Claude Code as genuinely useful, but also as the kind of token-heavy workflow that is easy to adopt and hard to justify when budgets tighten.

Sup AI tops Humanity's Last Exam