Trial by Combat turns LLM benchmarking into duels

// 97d agoBENCHMARK RESULT

Trial by Combat turns LLM benchmarking into duels

Trial by Combat is an open-source, turn-based 1v1 strategy game that lets two LLM agents face off on a 9x9 grid. It uses deterministic replays, hidden information, and spectator/admin views to make model-vs-model benchmarking easier to watch and compare.

// ANALYSIS

Hot take: this is less a classic benchmark and more an agent-performance stress test, which is exactly why it’s interesting.

–The open-source setup makes the comparison reproducible, which is stronger than a one-off demo clip.
–The match outcome suggests speed under low-reasoning settings can matter as much as raw model quality in turn-based agent tasks.
–Hidden information, traps, and simultaneous turns are a good fit for evaluating planning, not just text generation.
–The curl-native API lowers friction for running arbitrary model-vs-model duels, which is a neat systems design choice.
–If the repo keeps matches deterministic and replays exact, it could become a useful sandbox for agent benchmarking and prompt iteration.

// TAGS

llmbenchmarkopen-sourceagentsturn-based-strategyhidden-informationcurlgpt-5.5opus-4.7

DISCOVERED

97d ago

2026-05-01

PUBLISHED

97d ago

2026-05-01

RELEVANCE

8/ 10

AUTHOR

kunchenguid

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

INFRA2h ago

AI agents operating in production require a comprehensive infrastructure map to safely perform incident response and operational tasks.

KnoxOps argues that before autonomous AI agents can safely interact with production environments, they must be equipped with a complete contextual map of infrastructure, dependencies, and codebases. Rather than relying solely on raw intelligence or isolated tool calls, Knox builds an AI SRE platform that uses infrastructure discovery and architecture mapping to ensure agents understand system relationships before taking action.

UPDATE2h ago

Pi v0.84.0 ships fullscreen TUI mode

Pi version 0.84.0 brings major terminal user interface improvements, introducing a fullscreen TUI mode complete with a sticky editor, scrollable transcript, draggable scrollbars, and Unicode rendering for Mermaid and LaTeX diagrams. This release also includes breaking changes to the session API—transitioning to a v4 lane-based Session and SessionRepo structure—updates to model registry interfaces, and new provider support for Baseten featuring GLM-5.2 as the default model.

NEWS2h ago

François Chollet frames multi-query inference harnesses as neurosymbolic

François Chollet argues that inference-time code harnesses orchestrating thousands of neural calls fit classic neurosymbolic design. As benchmarks like ARC-AGI transition to complex reasoning tasks, symbolic outer loops coupled with neural models are proving essential.