Resurf ships reproducible browser-agent testbed
Resurf is a deterministic, open-source test framework for AI browser agents built around synthetic sites, failure injection, and auditable success checks. It aims to replace flaky live-web evals and judge-only scoring with something teams can actually reproduce.
This is the right kind of boring infrastructure: browser-agent evals need controlled environments more than they need another flashy benchmark.
- –`shop_v1` gives a realistic commerce flow with auth, checkout, returns, and ambiguous UI, so agents get tested on multi-step behavior instead of toy pages.
- –Failure-mode injection for latency, payment declines, 3DS, 5xxs, and session expiry is the main differentiator; that is how you measure recovery, not just happy-path navigation.
- –DB-state predicates are a cleaner success signal than LLM-based judging, which should make regressions easier to reproduce and debug.
- –Support for `browser-use`, `stagehand`, and a vision-only baseline makes it useful for teams already experimenting with browser agents.
DISCOVERED
2h ago
2026-05-07
PUBLISHED
2h ago
2026-05-07
RELEVANCE
AUTHOR
Visual-Librarian6601