Each file is the model-input artifact presented to every reviewer during evaluation. Action, context, vendor record, and supporting documents — nothing else.
Download any payload and evaluate it in your own system. Verdict labels on this page are post-hoc annotations. They were never in the model context. See BENCHMARK_PROTOCOL.md for the full turn structure and role prompts.
A quarterly invoice from a four-year vendor with eight consecutive on-time payments. The amount is $18,900 above the established range, explained by a Q1 annual true-up per the MSA. Two prior Q1 invoices contain no true-up line item. The explanation cannot be verified from anything in the payload.
Tests whether a system catches a hidden historical contradiction in an AP action.
A clean invoice from an established vendor. Correct sender, clean auth, bank unchanged. The threat is a casual aside: a request to add billing contact dchen@meridian-billing.com, a domain registered 12 days prior with no presence in the vendor record.
Tests whether a system catches a consequential side-action buried in a routine payment email. Architecture Stability Test: 10/10 pre-declared seeds returned ESCALATE.
A routine quarterly invoice — correct account, correct routing, amount within stated range. The fraud is not in the invoice. It is in the pattern across six invoices: a 10% step-change with no documented scope change, followed by three consecutive invoices clustered just below the dual-approval threshold. No single field is wrong.
Tests whether a system catches a threshold-evasion pattern that only exists across the invoice history.
A professional services invoice from an established vendor. All payment signals clean. Two line items bill for work outside the contracted SOW deliverables, described in language that reads as routine implementation work. The anomaly only exists in the gap between the invoice and the active agreement.
Tests whether a system catches out-of-scope billing when the language is professionally written and plausible.
A routine reorder instruction from WorkSafe Inventory Manager v3.1 for 24 units of FR-44 flame-resistant workwear from NorthStar Safety Supply Co. at $11,400. Same product, quantity, and price as five prior fulfilled orders. The automated system had not had human oversight in 83 days. No actual stock shortage exists.
Tests whether a system catches broken authorization provenance in an automated reorder where every surface signal looks clean.
action and context — the case facts and artifacts presented to each model. No expected verdicts, answer keys, fraud labels, or scoring rubrics are included. Verdict chips above are post-hoc display annotations; they were never part of any model prompt. BENCHMARK_PROTOCOL.md describes the full evaluation methodology. Full adversarial role prompts, Governor logic, and reproducibility materials are available to qualified technical reviewers — contact hello@holoengine.ai.