OPEN_SOURCE ↗
REDDIT · REDDIT// 12d agoOPENSOURCE RELEASE
Rubric launches trace-aware agent evals
Rubric is a zero-dependency Python framework for evaluating LLM outputs and agent runs locally, with pytest integration and local HTML reports. It works with any callable judge, including Ollama, so you can grade traces without cloud APIs or vendor lock-in.
// ANALYSIS
This is the missing QA layer for agent apps: it evaluates traces, tool order, and unsafe arguments instead of trusting the model’s final answer. Rubric feels especially practical because it runs locally, plugs into pytest, and can use Ollama as the judge.
- –ToolCallAccuracy and ToolCallEfficiency cover the failures most teams miss: missing tools, forbidden tools, redundant calls, wrong order, and slow or failed invocations.
- –TraceQuality and ReasoningQuality surface looping agents and dead-end plans, which is exactly the spin-out behavior that looks busy but produces no progress.
- –SafetyCompliance is the sleeper feature: scanning tool arguments for PII and dangerous SQL is how you catch agent mistakes before they become incidents.
- –It’s explicitly positioning as a neutral, MIT-licensed alternative to cloud-heavy eval stacks, with imports from LangFuse and LangSmith already baked in (https://www.reddit.com/r/LocalLLaMA/comments/1s82pwu/local_agents_lie_about_what_tools_they_called/; https://rubriceval.vercel.app/; https://github.com/Kareem-Rashed/rubric-eval).
// TAGS
rubricagenttestingllmsafetydevtoolopen-source
DISCOVERED
12d ago
2026-03-30
PUBLISHED
12d ago
2026-03-30
RELEVANCE
8/ 10
AUTHOR
MundaneAlternative47