AgentOps Forensics · Production self-check
Five failure scenarios. Choose what actually happens in your setup. At the end you get the one gap most likely to bite first, what it breaks, and a fix you can run this week.
Why this exists
Nous Research's Hermes Agent framework launched in February 2026 and crossed 95,000 GitHub stars in its first eight weeks. Adoption is fast. The production playbook is not written yet.
Most teams stall at the same spot. Demo and staging behave; production surfaces surprises, not because Hermes Agent is buggy, but because a long-running, stateful, autonomous control loop fails in a handful of specific ways, and the controls that catch them get skipped under launch pressure.
This check does not grade you against a benchmark we made up. It asks how five things really work in your deployment today, then names the one most likely to cost you first. No email to see the result. If your answers suggest an outside read would pay for itself, the result points you there. If not, you leave with a fix list.
The taxonomy
These five scenarios are the AgentOps Forensics checks in field form: cost runaway is useful-run cost, tool and action blast radius is tool safety, observability and debugging is failure trail, recovery and silent state loss is state integrity, and data and privacy exposure keeps its name. Migration is the sixth check, scored on its own. The free self-check covers the five; a paid Verdict applies all six to your numbers.
Spend discovered on the invoice instead of traced per run. A runaway loop or an uncapped tenant turns a normal week into a five-figure surprise.
A wrong tool, or a right tool with a wrong argument, writing to production before a human ever sees it.
A prompt, tool input, or log line carrying PII or a live key into a place nobody audited.
A failed run nobody can reconstruct, so the fix is a guess and the same failure returns next week.
A quiet memory or checkpoint fallback after a restart. The agent gets subtly dumber for days before anyone connects the dots.
The instrument
Answer how things really work today. Your responses never leave this browser tab.
Method
A fifty-item checklist turns into a research project nobody finishes, and most of its rows correlate anyway. Five scenarios, each tied to a way production agents actually fall over, separate a real exposure from a tidy to-do list without making you book a half-day.
The five come from a decade of production incident response on stateful distributed systems, not from Hermes Agent specifically. Hermes Agent is the new surface area; the failure modes transfer. Teams that have two of the five handled tend to survive a pilot. Teams that have all five tend to ship without 3 AM incidents.
Provenance
The scenarios are not abstract. Each maps to at least one class of publicly reported failure or near-failure this year.
The recovery scenario was framed after Replit's Day-9 incident, where an AI coding agent deleted a production database during a code freeze and recovery was disputed, not an automatic rollback. The blast-radius scenario tracks agents making payment or data-deletion calls without a review queue. The cost scenario reflects postmortems where spend controls surfaced only after material cost had accrued. The observability scenario reflects the common case where a silent fallback takes down an agent platform and nobody can reconstruct why.
Framing here is deliberately neutral. The check is about which controls are present, not which vendor is at fault. Production safety is framework-independent; the check uses Nous Research's Hermes Agent as the surface because that is where most readers of this tool are working right now.
Reference
Yes. Each maps to a publicly reported failure class: Replit's Day-9 production database deletion during a code freeze (recovery was disputed, not an automatic rollback), a reported Kiro-involved AWS incident that Amazon attributes to misconfigured access controls rather than AI autonomy, agents acting on irreversible categories without review, and postmortems where spend controls surfaced only after material cost had accrued. The check measures which controls are present, not which vendor is at fault.
A fifty-item checklist becomes a research project nobody finishes, and most of its rows correlate. Five scenarios cover the failure modes that actually take production agents down. You answer how things really work today, not yes or no.
About five minutes. One scenario per screen, four honest options each, plus one question about migration. You see your result before any email is asked for.
Your answers never leave your browser. Scoring runs on your device. The consultant link carries the matched offer route, your dominant gap, exposure level, and migration status, plus a SHA-256 hash of a random session id, for funnel attribution.
The framing is framework-version agnostic. The five failure modes apply regardless of Hermes Agent minor version.