ThermoQA benchmark flips thermodynamics rankings

// 131d agoBENCHMARK RESULT

ThermoQA benchmark flips thermodynamics rankings

ThermoQA is an open benchmark with 293 engineering thermodynamics problems spanning property lookups, component analysis, and full cycle analysis. The first leaderboard puts Claude Opus 4.6 narrowly ahead overall, but the bigger story is how much the rankings shift by tier, fluid, and model stability.

// ANALYSIS

This looks like a genuinely useful eval, not just another leaderboard screenshot: it tests exact numerical reasoning in a domain where small mistakes compound fast.

–The three-tier design is the right move, because it separates rote steam-table recall from multi-step cycle reasoning
–Supercritical water and R-134a appear to be the hardest regimes, which suggests current models are brittle around nonlinear property regions and weaker training coverage
–The run-to-run variance is as important as the mean score; a model that swings a lot is harder to trust in workflows that need repeatable answers
–Open dataset plus code makes ThermoQA easy to reproduce, extend, and use as a targeted stress test for reasoning agents
–The benchmark also hints that “knows thermodynamics” and “can solve thermodynamics” are not the same capability

// TAGS

thermoqabenchmarkllmreasoningresearchopen-sourcedata-tools

DISCOVERED

131d ago

2026-03-21

PUBLISHED

131d ago

2026-03-21

RELEVANCE

9/ 10

AUTHOR

olivenet-io

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE24m ago

ElevenLabs introduces Character Casting for Audiobooks

ElevenLabs has introduced Character Casting for Audiobooks, a feature that automatically analyzes uploaded manuscripts to detect individual characters and propose matching AI voices. Creators can preview proposed voices reading actual dialogue directly from their manuscript and apply global pronunciation rules across chapters.

UPDATE46m ago

B.AI launches API resource group with 10% discount

B.AI (TheB.AI) announced the release of its official API Resource Group, featuring a 10% discount designed to support developers, startups, and enterprise users. As AI integration accelerates, B.AI aims to address the demand for accessible, high-performance, and cost-managed AI infrastructure by streamlining API access and reducing overall operational expenses.

BENCHMARK49m ago

Merge Gateway cuts LLM costs 65%

Merge released benchmark data showing intelligent model routing cuts average task costs by 65% ($2.87 vs $8.17) while preserving 99.6% accuracy compared to fixed Opus 4.8. Routing overhead remained minimal with a median latency of 90–650ms per request across 120 trials.