EvalMonkey debuts local agent chaos testing
EvalMonkey is a strictly local, open-source framework for benchmarking AI agents against 10 HuggingFace datasets and then stress-testing them with chaos profiles like schema errors, latency spikes, rate limits, context overflow, and prompt injection. It supports custom agent endpoints and BYO model providers including OpenAI, Ollama, Bedrock, Azure, and GCP.
This is useful because it measures the failure mode most evals ignore: how much an agent degrades once the happy path stops being clean. The local-only design removes the biggest adoption blocker for teams that do not want to send eval traffic to a third-party service. Pairing a capability score with a resilience score is more actionable than a single benchmark number, especially for tool-using agents. The chaos profiles map to real production failures, which makes this relevant for orchestration, middleware, and agent reliability work. The maintainer-wanted framing suggests the project is early, but the problem it targets is real and under-served.
DISCOVERED
14h ago
2026-04-17
PUBLISHED
15h ago
2026-04-17
RELEVANCE
AUTHOR
Busy_Weather_7064