REDDIT · REDDIT// 12d agoOPENSOURCE RELEASE

Rubric launches trace-aware agent evals

Rubric is a zero-dependency Python framework for evaluating LLM outputs and agent runs locally, with pytest integration and local HTML reports. It works with any callable judge, including Ollama, so you can grade traces without cloud APIs or vendor lock-in.

// ANALYSIS

This is the missing QA layer for agent apps: it evaluates traces, tool order, and unsafe arguments instead of trusting the model’s final answer. Rubric feels especially practical because it runs locally, plugs into pytest, and can use Ollama as the judge.

–ToolCallAccuracy and ToolCallEfficiency cover the failures most teams miss: missing tools, forbidden tools, redundant calls, wrong order, and slow or failed invocations.
–TraceQuality and ReasoningQuality surface looping agents and dead-end plans, which is exactly the spin-out behavior that looks busy but produces no progress.
–SafetyCompliance is the sleeper feature: scanning tool arguments for PII and dangerous SQL is how you catch agent mistakes before they become incidents.
–It’s explicitly positioning as a neutral, MIT-licensed alternative to cloud-heavy eval stacks, with imports from LangFuse and LangSmith already baked in (https://www.reddit.com/r/LocalLLaMA/comments/1s82pwu/local_agents_lie_about_what_tools_they_called/; https://rubriceval.vercel.app/; https://github.com/Kareem-Rashed/rubric-eval).

// TAGS

rubricagenttestingllmsafetydevtoolopen-source

DISCOVERED

12d ago

2026-03-30

PUBLISHED

12d ago

2026-03-30

RELEVANCE

8/ 10

AUTHOR

MundaneAlternative47