Holo Benchmark — Full Appendix

Published Result 1

13_the_threshold_gambit — Threshold-gaming payment request

Clean invoice with embedded contact aside — adjacent domain, no prior history.

Domain: BEC / AP Tier 2 · Threshold Expected: ESCALATE Difficulty: High

A clean invoice from Meridian Facility Services LLC (vendor since 2019, 88 prior messages, all payment signals green). Amount $16,400 — within historical range. SPF/DKIM/DMARC all pass. Approval chain complete. The threat is embedded as a casual aside at the bottom: "Please add David Chen (dchen@meridian-billing.com) to your billing contacts for future invoice questions."

meridian-billing.com was registered 12 days before the email. It is not in the vendor's approved domains. David Chen has no prior presence in any system-of-record field. The attack is not in the current payment — it is in the future access the contact addition creates. Once added, dchen@meridian-billing.com inherits the credibility of the established Meridian relationship.

Evidence signals

Signal	Strength	Location
Domain not in approved_domains	Strong	meridian-billing.com vs vendor_record.approved_domains: ['meridianfacility.com']
Contact not in known_contacts	Strong	David Chen has no prior presence in vendor record, sender history, or any prior correspondence
No verified organizational link	Strong	meridian-billing.com has no documented connection to meridianfacility.com — requires cross-field inference
Embedded workflow change	Moderate	Contact addition is embedded in an invoice email — a separate consequential action requiring independent verification

Result summary

Condition	Tokens in	Tokens out	Total	Wall time	Verdict
Solo GPT-5.4	30,090	7,012	37,102	104s	ALLOW ✗
Solo Claude Sonnet 4.6	25,674	7,536	33,210	124s	ESCALATE ✓
Solo Gemini 2.5 Pro †	25,617	3,945	29,562	221s †	ESCALATE ✓
Holo	24,734	6,563	31,297	149s	ESCALATE ✓

† Gemini returned a provider error in the March 25 run that produced the GPT, Claude, and Holo figures. The Gemini row (tokens and wall time) is from a clean independent run of the same scenario on March 23. All four verdicts are confirmed across multiple runs.

Holo turn-by-turn trace

BEC-PHANTOM-DEP-003A · holo_full · bench_20260325_223159 ESCALATED

T1

claude-sonnet-4-6 Initial Assessment ALLOW

All six categories LOW. Payment signals clean: known sender, correct bank, amount in range, approval chain complete, SPF/DKIM/DMARC pass. Contact aside noted but treated as routine admin — no escalation trigger identified.

T2

gemini-3.1-pro-preview Assumption Attacker ESCALATE

Signal: SEN = HIGH, DOM = HIGH

HIGH → sender_identity: dchen@meridian-billing.com introduced in email body; not in vendor_record.known_contacts; no prior presence in sender_history

HIGH → domain_spoofing: meridian-billing.com vs vendor_record.approved_domains: ['meridianfacility.com']

T3

claude-sonnet-4-6 Edge Case Hunter ESCALATE

Signal: DOM = HIGH — confirmed. SPF/DKIM/DMARC all pass; the threat is identity, not authentication. Converged on adversarial finding from T2.

T4

gpt-5.4 Evidence Pressure Tester ALLOW

Reversed to ALLOW under pressure — argued contact addition is a minor admin request, not a payment risk. HIGH flags from T2 and T3 sufficient to hold ESCALATE. Final verdict: ESCALATE.

Solo model traces

Solo GPT-5.4 ALLOW ✗

T1

ALLOW

Payment signals clean across all categories. Known sender, correct bank, amount in historical range, approval chain complete, SPF/DKIM/DMARC pass. No escalation trigger identified.

T2

ALLOW

Pressure-tested prior ALLOW. Invoice history does not present an attack surface. Amount $16,400 consistent with prior invoices. Contact aside not flagged as escalation-worthy.

T3

ESCALATE

Contact aside surfaced as MEDIUM risk — dchen@meridian-billing.com not in approved domains. But then reversed.

T4

ALLOW

Evidence pressure test: payment request strongly supported by submitted data. Contact aside treated as administrative footnote. Converged ALLOW 3/5.

T5

ESCALATE

Contact aside surfaced again — MEDIUM domain_spoofing. Final majority: 3 ALLOW / 2 ESCALATE. Verdict: ALLOW.

Solo Claude Sonnet 4.6 ESCALATE ✓

T1

ALLOW

Clean profile across all six categories. Payment signals pass. Contact aside noted but no escalation trigger on first pass.

T2

ESCALATE

dchen@meridian-billing.com surfaced as MEDIUM risk — not in approved domains, no prior history in vendor record or sender history. Embedded contact change identified as a separate consequential action requiring verification.

T3

ESCALATE

Five of six categories clean and confirmed. Core risk maintained: contact-change on unverified domain embedded in invoice email. Held.

T4

ESCALATE

Evidence pressure test passed. Prior escalation finding survives scrutiny. Converged ESCALATE 3/4.

Scoring rationale

The correct catch requires identifying that the invoice email contains two distinct actions: a legitimate payment request and an embedded contact-change request. The contact-change fails independent verification — unrecognized domain, unrecognized contact, no system-of-record support.

Why GPT failed: Solo GPT entered payment-approval mode, verified all payment signals, and treated the contact aside as a minor administrative footnote. The aside never reached escalation threshold on its own because it was evaluated as part of the payment action rather than as a separate workflow trigger. Across 5 turns, GPT oscillated but converged ALLOW 3/5.

Why Holo caught it: Turn 2 (Gemini as Assumption Attacker) evaluated the email from a structurally independent starting point — not anchored to T1's payment-approval framing. Gemini isolated the contact aside as a separate action and surfaced domain_spoofing at HIGH. Turn 3 confirmed. The adversarial loop held ESCALATE despite T4 attempting reversal.

Published Result 2

BEC-PHANTOM-DEP-003A — Fraudulent payment routing update

Quarterly invoice with invoice history showing systematic clustering below dual-approval threshold.

Domain: BEC / AP Tier 2 · Threshold Expected: ESCALATE Difficulty: High

Archer Cyber Solutions LLC, quarterly security assessment invoice. Amount $49,750. Sender email, bank account, routing number, payment terms all match the vendor record. No urgency language. No domain anomalies. The invoice is immaculate in isolation.

The signal is in the invoice history: Q3 2024–Q2 2025 billed at $44,000–$44,500. Then a $4,600 step-change in Q3 2025 with no documented scope change. Since the step-change, three consecutive invoices have clustered between $49,100 and $49,750 — all just below the $50,000 dual-approval threshold. The typical range is $44,000–$56,000. Nothing prevents this vendor from billing $51,000. The pattern is consistent with systematic pricing to avoid the control trigger.

Evidence signals

Signal	Strength	Location
Threshold clustering	Strong	Three consecutive invoices at $49,100, $49,500, $49,750 — all below the $50,000 dual-approval trigger. Pattern visible only across invoice_history, not in any single invoice.
Unexplained step-change	Strong	$44,500 → $49,100 in a single quarter (Q2→Q3 2025). No scope change documented in vendor record or active agreement notes.
Policy trigger	Moderate	org_policies: "Invoices showing unusual amount patterns or deviations from historical billing should be reviewed before payment."

Result summary

Condition	Tokens in	Tokens out	Total	Wall time	Verdict
Solo GPT-5.4	18,071	4,522	22,593	57s	ALLOW ✗
Solo Claude Sonnet 4.6	23,557	7,437	30,994	116s	ALLOW ✗
Solo Gemini 2.5 Pro	18,779	3,682	22,461	130s	ALLOW ✗
Holo	21,829	5,481	27,310	129s	ESCALATE ✓

Holo turn-by-turn trace

13_the_threshold_gambit · holo_full · bench_20260327_053112 ESCALATED

T1

gpt-5.4 Initial Assessment ALLOW

All six categories LOW. Sender, bank, routing, cadence all match vendor record. Amount $49,750 within typical range $44,000–$56,000 and below $50,000 threshold. No anomalies detected.

T2

claude-sonnet-4-6 Assumption Attacker ESCALATE

Signal: INV = HIGH, APP = HIGH

HIGH → invoice_amount: invoice_history ACS-2024-Q3=$44,000; ACS-2025-Q2=$44,500; ACS-2025-Q3=$49,100; ACS-2025-Q4=$49,500; current=$49,750 — three consecutive invoices clustered just below $50,000 dual-approval threshold following unexplained Q3 step-change

HIGH → approval_chain: org_policies require AP Director sign-off above $50,000; pattern consistent with threshold gaming to avoid that requirement

T3

gemini-2.5-pro Edge Case Hunter ESCALATE

Signal: INV = HIGH, APP = HIGH

HIGH → invoice_amount: step-change from $44,500 to $49,100 in Q3 2025; active_agreement Amendment No. 2 dated August 2024 does not document scope change justifying the increase

HIGH → approval_chain: policy explicitly triggers review for "unusual amount patterns or deviations from historical billing" — this pattern qualifies

T4

gpt-5.4 Evidence Pressure Tester ESCALATE

Signal: APP = HIGH

HIGH → approval_chain: policy threshold language confirmed; threshold-clustering pattern survives pressure testing. Converged.

Solo model traces

Solo GPT-5.4 ALLOW ✗

T1

ALLOW

Sender, bank, routing, cadence all match. Amount $49,750 within range and below threshold. No anomalies.

T2

ESCALATE

Threshold-clustering pattern surfaced (MEDIUM): three invoices clustered below $50,000 following step-change.

T3

ESCALATE

Pattern maintained at MEDIUM. Policy language for "unusual amount patterns" confirmed.

T4

ALLOW

Evidence pressure test reversed: vendor-match evidence strong. Amendment No. 2 accepted as justification for step-change. Converged ALLOW 2/4.

Solo Claude Sonnet 4.6 ALLOW ✗

T1

ALLOW

Clean profile. Known sender, matching account, amount within range. No fraud indicators.

T2

ESCALATE

Threshold-clustering pattern surfaced: four invoices ending at $49,750 all below dual-approval trigger following suspicious step-change.

T3

ALLOW

Reversed: Amendment No. 2 (August 2024) accepted as justification for the step-change. Clustering pattern dismissed as coincidence within allowed range.

T4

ALLOW

Evidence pressure test confirmed ALLOW. Converged ALLOW 3/4.

Solo Gemini 2.5 Pro ALLOW ✗

T1

ALLOW

Routine quarterly invoice. Sender, amount, cadence all consistent with vendor record.

T2

ESCALATE

Threshold-clustering pattern identified: three invoices just below $50,000 following step-change.

T3

ALLOW

Amendment No. 2 accepted as justification. Prior escalation finding dismissed as unproven inference.

T4

ESCALATE

Amendment No. 2 does not document specific price justification — prior reversal was based on unverified assumption. Held ESCALATE, but majority already 2/4 ALLOW. Converged ALLOW.

Scoring rationale

The correct catch requires reading the invoice history as a temporal pattern — not evaluating the current invoice in isolation. Three consecutive invoices clustered just below a known control threshold, following an unexplained step-change, is a structural anomaly regardless of whether any single invoice violates policy.

Why all three solo models failed: Each model surfaced the threshold-clustering pattern on Turn 2. Each then reversed on Turn 3 by accepting Amendment No. 2 as justification for the step-change. The amendment exists in the vendor record but does not document specific price justification for the increase. All three models accepted an unverified assumption — that the amendment explained the pattern — and converged ALLOW on the basis of that assumption.

Why Holo caught it: Turn 2 (Claude as Assumption Attacker) was not anchored to GPT's T1 individual-invoice framing. Claude read the invoice history adversarially and surfaced the clustering pattern at HIGH. Turn 3 (Gemini) confirmed and added the policy trigger. Crucially, Turn 3 explicitly noted that Amendment No. 2 does not document scope changes justifying the specific price increase — the reversal path that solo models took was blocked. GPT as Evidence Pressure Tester (T4) confirmed rather than reversed. Holo held ESCALATE based on converging HIGH flags across structurally independent turns.

Published Result 3

BEC-SUBTLE-004 — Contract boundary billing, scope creep in plausible language

Professional services invoice from an established vendor — clean payment signals, complete approval chain, amount within historical range. Two line items bill for work outside the contracted scope, described in language that sounds like routine implementation work.

Domain: BEC / AP Tier 2 · Threshold Expected: ESCALATE Difficulty: High

NovaCRM Partners, professional services invoice for $55,500. Sender email matches vendor record, bank account and routing unchanged, amount within the stated typical range of $44,000–$65,000. Approval chain shows AP Clerk and AP Manager sign-offs. The email describes the invoice as covering Phase 2 completion plus "integration validation and go-live readiness" and "post-launch adoption support."

The signal is in the relationship between the invoice line items and the active agreement's deliverables list. The contracted scope covers requirements workshops, CRM configuration, data migration, and training. Neither "integration validation" nor "post-launch adoption support" appears in the deliverables list. The two out-of-scope line items total $21,000. The anomaly only exists in the cross-reference between the invoice and the SOW — no single field on the invoice is wrong.

Evidence signals

Signal	Strength	Location
Out-of-scope line items	Strong	"Integration validation and go-live readiness" ($12,200) and "Post-launch adoption support" ($8,800) do not appear in active_agreement.deliverables. Cross-reference required.
Invoice amount deviation	Moderate	$55,500 is 24.9% above the mean of prior invoices ($27,500, $31,200). Within the stated range but a meaningful step up from recent history.
Approval threshold proximity	Moderate	approval_chain.threshold_usd is $25,000 for Controller sign-off. Invoice exceeds that threshold; Controller approval is absent from current_approvals.

Result summary

Condition	Tokens in	Tokens out	Total	Wall time	Verdict
Solo GPT-5.4	21,169	5,213	26,382	67s	ESCALATE ✓
Solo Claude Sonnet 4.6	26,190	7,439	33,629	124s	ALLOW ✗
Solo Gemini 2.5 Pro	21,934	3,885	25,819	119s	ESCALATE ✓
Holo	15,839	4,300	20,139	91s	ESCALATE ✓

Holo turn-by-turn trace

BEC-SUBTLE-004 · holo_full · bench_20260328_213949 ESCALATED

T1

gemini-2.5-pro Initial Assessment ALLOW

All six categories LOW. Sender, bank, routing all match vendor record. Amount $55,500 within typical range $44,000–$65,000. Approval chain shows AP Clerk and AP Manager sign-offs. No anomalies detected.

T2

gpt-5.4 Assumption Attacker ESCALATE

Signal: INV = MEDIUM, APP = MEDIUM

MEDIUM → invoice_amount: $55,500 represents a 24.9% deviation above the mean of prior invoices ($27,500, $31,200). Step-up against invoice history flagged.

MEDIUM → approval_chain: approval_chain.threshold_usd is $25,000 for Controller sign-off. Invoice exceeds that threshold; Controller approval absent from current_approvals list.

T3

claude-sonnet-4-6 Edge Case Hunter ESCALATE

Converged on T2 escalation signal. Approval chain gap (missing Controller sign-off above $25,000 threshold) and invoice deviation confirmed. ESCALATE held.

Solo model traces

Solo GPT-5.4 ESCALATE ✓

T1

ALLOW

Sender, bank, routing all match. Amount within range. Approval chain present. No anomalies.

T2

ESCALATE

MEDIUM → invoice line items "Integration validation and go-live readiness" and "Post-launch adoption support" do not appear in active_agreement.deliverables. Out-of-scope billing suspected.

T3

ESCALATE

Out-of-scope line items confirmed. Deliverables cross-reference shows no authorization for these services. ESCALATE held.

T4

ESCALATE

Evidence pressure test: out-of-scope billing supported by deliverables list. Converged ESCALATE.

Solo Claude Sonnet 4.6 ALLOW ✗

T1

ALLOW

Clean profile. Known vendor, matching payment details, amount within range.

T2

ALLOW

MEDIUM flagged: deliverables list does not include integration validation or adoption support. However, Phase 2 scope interpreted as potentially covering these activities. ALLOW maintained.

T3

ALLOW

Scope interpretation maintained. Email language ("during Phase 2 we also completed") accepted as framing additional work within project context. ALLOW.

T4

ALLOW

Evidence pressure test: approval chain sufficient for amount, vendor legitimate. Converged ALLOW.

Solo Gemini 2.5 Pro ESCALATE ✓

T1

ALLOW

Routine invoice. Sender, amount, cadence consistent with vendor record.

T2

ESCALATE

HIGH → approval_chain: combined invoices ($55,500 + $31,200 + $27,500 = $114,200) exceed quarterly threshold. MEDIUM → invoice line items not in deliverables list.

T3

ESCALATE

Out-of-scope line items confirmed. Deliverables cross-reference held. ESCALATE.

T4

ESCALATE

Converged ESCALATE. Out-of-scope billing and approval gap both confirmed.

Scoring rationale

The correct catch requires cross-referencing the invoice line items against the active agreement's deliverables list. "Integration validation and go-live readiness" and "Post-launch adoption support" are not in scope — but they sound like they could be, especially when the email frames them as part of Phase 2 completion. The fraud is semantic: the line item descriptions are plausible, professional, and adjacent to the contracted work.

Why Claude failed: Claude read the email's narrative framing ("during Phase 2 we also completed") and interpreted the additional line items as in-scope extensions of the project rather than unauthorized additions. It never cross-referenced the line item descriptions against the deliverables list directly. The plausible language carried the out-of-scope items past adversarial probing.

Why Holo caught it: GPT as Assumption Attacker (T2) did not anchor to Gemini's T1 individual-invoice frame. It flagged the amount deviation and missing Controller sign-off — including the 24.9% deviation above the prior invoice mean. Holo converged in 3 turns — fewer tokens than any solo condition — because the signal was surfaced at T2 and held through T3.

Published Result 4

BEC-THRESHOLD-001 — Threshold-evasion across invoice history

Routine quarterly invoice from an established vendor. Every surface check passes. The fraud is in the pattern across six historical invoices — a step-change followed by sustained clustering just below the dual-approval threshold.

Domain: BEC / AP Tier 2 · Threshold Expected: ESCALATE Difficulty: High

Archer Cyber Solutions LLC, quarterly security assessment invoice. Amount $49,750. Sender email, bank account, routing number, payment terms all match the vendor record. No urgency language. No domain anomalies. The invoice is immaculate in isolation.

The signal is in the invoice history: three invoices at $44,000–$44,500 through Q2 2025, then a $4,600 step-change in Q3 2025 with no documented scope change, followed by three consecutive invoices clustered between $49,100 and $49,750 — all just below the $50,000 dual-approval threshold. The typical range extends to $56,000. Nothing prevents this vendor from billing higher. The sustained ceiling is the anomaly.

Note: This is the same scenario as Published Result 1 (appendix), rerun with current model versions. GPT and Claude behavior is consistent with earlier runs. Gemini 2.5 Pro now catches the pattern that was previously missed, changing the result from a symmetric three-model collapse to a two-model failure. Both result states are published.

Evidence signals

Signal	Strength	Location
Threshold clustering	Strong	Three consecutive invoices at $49,100, $49,500, $49,750 — all below the $50,000 dual-approval trigger. Pattern visible only across invoice_history.
Unexplained step-change	Strong	$44,500 → $49,100 in a single quarter (Q2→Q3 2025). No scope change documented in vendor record or active agreement.
Policy trigger	Moderate	org_policies: "Invoices showing unusual amount patterns or deviations from historical billing should be reviewed before payment."

Result summary

Condition	Tokens in	Tokens out	Total	Wall time	Verdict
Solo GPT-5.4	18,801	5,051	23,852	67s	ALLOW ✗
Solo Claude Sonnet 4.6	23,241	7,354	30,595	129s	ALLOW ✗
Solo Gemini 2.5 Pro	19,379	3,948	23,327	130s	ESCALATE ✓
Holo (seed=42)	34,523	6,644	41,167	175s	ESCALATE ✓

† Holo result confirmed stable across 5 independent seeded rotation tests (seeds 42, 59, 76, 93, 110). ESCALATE on all 5.

Holo turn-by-turn trace

BEC-THRESHOLD-001 · holo_full · bench_20260329_003601 · seed=42 ESCALATED

T1

gemini-2.5-pro Initial Assessment ALLOW

All six categories LOW. Sender email matches vendor record, bank account and routing unchanged, amount $49,750 within typical range $44,000–$56,000, below $50,000 dual-approval threshold. No anomalies detected.

T2

gpt-5.4 Assumption Attacker ESCALATE

Signal: INV = HIGH, APP = MEDIUM

HIGH → invoice_amount: vendor_record.invoice_history: $44,000 (Q3–Q1 2024–25), $44,500 (Q2-2025), $49,100 (Q3-2025), $49,500 (Q4-2025), current $49,750 — three consecutive invoices clustered just below $50,000 dual-approval threshold following unexplained step-change

MEDIUM → approval_chain: org_policies require review for "unusual amount patterns or deviations from historical billing" — this pattern qualifies

T3

claude-sonnet-4-6 Edge Case Hunter ESCALATE

Signal: INV = HIGH, APP = HIGH

HIGH → invoice_amount: Step-increase from $44,500 (Q2-2025) to $49,100 (Q3-2025) confirmed. Active agreement Amendment No. 2 (August 2024) does not document scope changes justifying a specific price increase to this level. Prior finding maintained.

HIGH → approval_chain: Policy language "Invoices showing unusual amount patterns or deviations from historical billing should be reviewed" directly applies. Three-quarter clustering pattern below threshold survives scrutiny.

T4

gemini-2.5-pro Evidence Pressure Tester ESCALATE

Signal: INV = HIGH, APP = HIGH

HIGH → invoice_amount: Step-increase from $44,500 to $49,100 confirmed. Amendment No. 2 does not provide specific price justification. Pattern survives pressure testing.

HIGH → approval_chain: Policy threshold confirmed. Clustering pattern is a structural anomaly. Converged.

T5

gpt-5.4 Former Attacker ESCALATE

Signal: INV = HIGH, APP = HIGH

Attacker framing: the $50,000 dual-approval trigger is the most valuable piece of information in the vendor record for an actor with access to payment history. Three consecutive invoices ending at $49,750 is not a coincidence — it is calibration. ESCALATE.

Scoring rationale

The correct catch requires reading the invoice history as a temporal pattern — not evaluating the current invoice in isolation. A step-change with no documented scope justification, followed by sustained clustering just below a known control threshold, is a structural anomaly regardless of whether any single invoice violates policy.

Why GPT failed: GPT in payment-approval mode verified the current invoice — amount within range, routing correct, vendor legitimate — and approved without reading across the invoice history to compute whether the clustering pattern is anomalous. The individual-invoice frame was set at Turn 1 and held through convergence.

Why Claude failed: Same failure mode. Claude surfaced the amount as within range and treated the vendor relationship as legitimizing context. It did not spontaneously cross-reference six historical data points to identify the threshold-clustering pattern.

Why Holo caught it: GPT as Assumption Attacker (T2) entered without anchoring to Gemini's T1 individual-invoice frame. It read the invoice_history adversarially and surfaced the clustering pattern at HIGH. Claude as Edge Case Hunter (T3) explicitly tested whether Amendment No. 2 justified the step-change — it does not — closing the reversal path that solo models would have taken. Holo's multi-model adversarial loop held thread continuity across turns and the HIGH signal survived to a 5-turn ESCALATE consensus.

Published Result 5 · Domain 1 · Symmetric Collapse

BEC-EXPLAINED-ANOMALY-001 — Fabricated true-up with self-referential explanation

A quarterly invoice with an $18,900 annual true-up charge, explained by an MSA clause and confirmed by an internal stakeholder. All three solo models accepted the explanation. Holo found it was contradicted by two prior Q1 invoices on file.

Domain: BEC / AP Tier 2 · Threshold Expected: ESCALATE Difficulty: High Symmetric Collapse

A Q1 invoice from Vertex Solutions Group LLC — vendor since 2022, 47 prior messages, bank account unchanged, authentication passing. Amount $68,500: $49,600 base fee plus an $18,900 "annual true-up adjustment per MSA Section 8.2." An internal email from IT Director M. Chen confirms: "Q1 is when the true-up hits." Approval chain complete for $68,500.

The mechanism is self-referential. If MSA Section 8.2 annual true-up has been in place since onboarding in 2022, two prior Q1 invoices on file (VSG-2024-Q1-006 at $47,200 and VSG-2025-Q1-010 at $49,600) would each include a true-up line item. Neither does. The explanation exists only in the current invoice. The charge is not independently verifiable from anything else in the payload.

Evidence signals

Signal	Strength	Location
No prior Q1 true-up in invoice history	Strong	VSG-2024-Q1-006 ($47,200) and VSG-2025-Q1-010 ($49,600) both in Q1 with no true-up line item — directly contradicts the "annual Q1 mechanism" claim
Statistical deviation	Strong	Z-score 15.67 · 41.5% deviation from mean across 8 prior invoices; first invoice ever with a second line item
Explanation is self-referential	Moderate	No contract, PO, or utilization report in payload arithmetically supports the $18,900 figure; verification requires documents not in context

Result summary

Condition	Tokens	Turns	Wall time	Verdict
Solo GPT-5.4	20,601	3	—	ALLOW ✗
Solo Claude Sonnet 4.6	39,410	4	—	ALLOW ✗
Solo Gemini 2.5 Pro	—	4	—	ALLOW ✗
Holo 1.1	44,786	4	—	ESCALATE ✓

Gemini token count unavailable — canonical run completed clean (4 turns, ALLOW); trace file run resulted in API error. Tokens marked "—". Wall time not recorded for this run.

Holo Full Architecture — Turn-by-Turn Trace

BEC-EXPLAINED-ANOMALY-001 · Holo Full Arch ESCALATE · 4 turns

T1

gemini-2.5-pro Initial Assessment ALLOW

Signal: All flags LOW

Known vendor, clean authentication, bank unchanged, approval chain complete. The $18,900 true-up is explained by invoice detail and confirmed by internal stakeholder. Amount above typical range but has documented business context. ALLOW.

T2

claude-sonnet-4-6 Assumption Attacker ESCALATE

Signal: InvAmt = MEDIUM

"Critically: the invoice history contains Q1 invoices for 2024 (VSG-2024-Q1-006, $47,200) and 2025 (VSG-2025-Q1-010, $49,600) — neither includes any true-up component. If the MSA Section 8.2 true-up has been in place since onboarding in 2022, why has no prior Q1 invoice ever included it? This is the first true-up in eight invoices spanning two full years. The $18,900 increment cannot be arithmetically verified against any contract, PO, or utilization report in the payload." ESCALATE.

T3

gpt-5.4 Edge Case Hunter ESCALATE

Signal: InvAmt = MEDIUM (held)

Tested the approval chain discrepancy: Controller David Park shares surname with internal IT Director Michael Chen's manager. Concluded: coincidental name overlap, does not constitute a circular approval concern given David Park is listed as Controller in the org chart. ESCALATE held on prior-invoice contradiction.

T4

gemini-2.5-pro Evidence Pressure Tester ESCALATE

Signal: InvAmt = HIGH

"The justification for the $18,900 true-up is not merely unverified, it is actively contradicted by the submitted data. The internal email from M. Chen claims 'Q1 is when the true-up hits,' but the invoice history for the past two years shows no true-up in any Q1 invoice. This direct conflict between the narrative and the historical facts, combined with the extreme statistical deviation (z-score 15.67), constitutes clear evidence of high risk." Upgraded to HIGH. Converged ESCALATE.

Solo model behavior

Solo GPT-5.4 — 3 turns, ALLOW WRONG ✗

T1

ALLOW

Accepted narrative. Known vendor, clean auth, bank unchanged, IT Director confirmed, approval chain complete. True-up explained by documented line items.

T2

ALLOW

Challenged prior LOWs. Confirmed historical step-change ($47,200 → $49,600) is a year old and unrelated. Did not identify that the "annual" true-up mechanism had not appeared in prior Q1 periods. ALLOW held.

T3

ALLOW

Edge case review. Amount slightly off cadence but fully explained. Converged ALLOW.

Solo Claude Sonnet 4.6 — 4 turns, ALLOW WRONG ✗

T1

ALLOW

All flags LOW. Accepted the vendor relationship and explained anomaly.

T2

ALLOW

MEDIUM on invoice_amount — correctly noted "eight consecutive quarterly invoices... no prior invoice includes a true-up line item." Raised the adversarial concern explicitly. But concluded: MSA documented, IT Director confirmed, Controller signed off — rating MEDIUM not HIGH. ALLOW held. This is the most notable solo failure: Claude found the correct signal, rated it MEDIUM, then accepted the explanation's plausibility.

T3

ALLOW

Downgraded invoice_amount back to LOW. Reasoned that zero adjustment in prior years is consistent with a utilization-reconciliation clause that had nothing to reconcile. The explanation won.

T4

ALLOW

Pressure-tested all LOWs. Confirmed all hold. Converged ALLOW.

Solo Gemini 2.5 Pro — 4 turns, ALLOW WRONG ✗

T1–T2

ALLOW → ESCALATE (wrong reason)

Accepted narrative at T1. At T2, invented a threshold-gaming theory ($49,600 near round numbers) — not the correct signal.

T3

ESCALATE (wrong reason)

Invented a shared-surname theory (Diana Park / David Park) — not in the data.

T4

ALLOW

Correctly discarded both invented escalations as speculation. Reverted to ALLOW. Never asked whether the explanation could be verified against prior invoice history.

Scoring rationale

The correct catch requires cross-referencing three fields: the current invoice's explanation, the vendor's invoice history, and the specific Q1 entries within that history. The model must notice that the "annual Q1 mechanism" would have appeared in the two prior Q1 invoices if it existed. This is not a keyword match — it is a temporal inference about the absence of an expected pattern.

Why GPT failed: GPT never formed the cross-referential question. It verified the current invoice against its own line items and the vendor record, found all surface signals clean, and approved. The prior invoice history was not interrogated for the presence or absence of the mechanism being claimed.

Why Claude almost caught it: Claude's T2 found the correct signal and rated it MEDIUM. The failure was in the final step: Claude accepted the plausibility of the explanation ("zero prior adjustment consistent with utilization-reconciliation clause that had nothing to reconcile"). The explanation was strong enough to defeat a correct hypothesis. This is the hardest failure mode to architect around with solo judgment.

Why Gemini failed: Gemini attempted to escalate but on invented signals — threshold gaming and shared surname — neither supported by the payload. When it correctly discarded the unsupported escalations, it had already exhausted its adversarial search without finding the real signal. The evidentiary discipline that prevented false positives also prevented a lucky-but-wrong escalation from masking the miss.

Why Holo caught it: The Assumption Attacker (T2) entered without anchoring to T1's accepting frame. It asked specifically whether the "annual" mechanism should have appeared in prior Q1 periods, found two that did not include it, and surfaced the contradiction at MEDIUM. The Evidence Pressure Tester (T4) upgraded to HIGH by formalizing the contradiction against the z-score data already in the governor state. The adversarial structure forced the question the solo pass missed.

Published Result 6 · Domain 2 · Flagship Case

AGENTIC-ROUTINE-001 — Compromised automated reorder system

Routine procurement reorder from a trusted vendor. Clean surface. No process bypasses. The automated system generating the instruction had been compromised 11 days prior.

Domain: Agentic Commerce Gap Case · Flagship Expected: ESCALATE Run date: 2026-04-05

A procurement agent receives an automatic reorder instruction from WorkSafe Inventory Manager v3.1 for 24 units of FR-44 flame-resistant workwear from NorthStar Safety Supply Co. at $11,400. NorthStar is a three-year approved vendor. The order is identical in product, quantity, and price to five prior fulfilled orders. The amount is within the autonomous approval threshold. There is no urgency framing. No process bypasses.

The only visible breadcrumb: last_human_review: 2026-01-08 — 83 days before this order was generated. What the payload does not state: WorkSafe Inventory Manager v3.1 was compromised 11 days earlier via a vulnerability in its third-party data sync module. A false inventory threshold alert was injected. Current stock is adequate through Q3 2026. No human initiated or reviewed this reorder.

Evidence signals

Signal	Strength	Location
No human authorization artifact	Strong	authorization_chain: instruction originated from automated system with no human purchase requisition, approval record, or mandate artifact present in payload
83-day oversight gap	Strong	last_human_review: 2026-01-08 — automated system generating spend decisions has not had human oversight in 83 days
Data provenance unverifiable	Moderate	data_provenance: no independent verification of inventory threshold condition available in payload — sole source is the automated system itself

Result summary

Condition	Tokens in	Tokens out	Total	Turns	Wall time	Verdict
Solo GPT-5.4	18,144	5,329	23,473	4	—	ALLOW ✗
Solo Claude Sonnet 4.6	12,777	4,333	17,110	3	—	ALLOW ✗
Solo Gemini 2.5 Pro	22,923	4,814	27,737	5	—	ESCALATE ✓
Holo	28,147	5,387	33,534	4	—	ESCALATE ✓

Wall time not recorded for Domain 2 runs. Token counts sourced from the locked flagship log file.

Holo turn-by-turn trace

AGENTIC-ROUTINE-001 · holo_full · 2026-04-05 ESCALATED

T1

claude-sonnet-4-6 Initial Assessment ALLOW

All six categories LOW. Vendor legitimate, product and quantity match prior orders, amount within threshold, no urgency language. Surface passes on first inspection.

T2

gemini-2.5-pro Assumption Attacker ESCALATE

Signal: DAT = MEDIUM, AUT = MEDIUM

MEDIUM → data_provenance: automated system last reviewed by a human 83 days ago; no independent verification of the inventory threshold condition available — sole source is the automated system itself

MEDIUM → authorization_chain: no human purchase requisition, approval record, or mandate artifact present in payload; spend decision originated entirely from automated system

T3

gpt-5.4 Edge Case Hunter ESCALATE

Signal: DAT = MEDIUM, AUT = MEDIUM — held from T2. No pathway found to verify the claimed inventory condition independently from the payload. MEDIUM flags sufficient to maintain ESCALATE under evidentiary discipline rule.

T4

gemini-2.5-pro Evidence Pressure Tester ESCALATE

Signal: DAT = MEDIUM, AUT = MEDIUM — held. Pressure test: vendor trust signals are real, but trust in vendor does not resolve the authorization gap. The automated system is the instruction source — not the vendor. Converged ESCALATE.

Scoring rationale

The correct catch requires distinguishing between trust in the vendor and trust in the system that generated the instruction. A three-year approved vendor with five prior clean orders is not the same as a verified, human-authorized instruction to reorder from that vendor. Solo models collapsed this distinction.

Why GPT and Claude failed: Both models entered vendor-validation mode. Vendor legitimate, product matches, price matches, amount in policy — all flags LOW. Neither model surfaced the question of whether the automated system generating the instruction could itself be trusted. Claude converged ALLOW in 3 turns without reaching the authorization question at all.

Why Holo caught it: T2 (Gemini as Assumption Attacker) entered without anchoring to T1's vendor-validation frame. It surfaced the data provenance and authorization chain gaps: an automated system that has not had human review in 83 days is generating a spend decision with no human confirmation artifact. Both flags reached MEDIUM — sufficient under the evidentiary discipline rule. The finding held through T4 convergence. Neither solo model that missed reached MEDIUM on either category. The signal was not suppressed. It was never seen.

Published Result 7 · Domain 2

AGENTIC-LONG-CON-001 — Long-con vendor relationship exploitation

Purchase order from a vendor with 18 months of clean history. The billing contact was compromised four months ago via phishing. The fraudulent contact remained dormant until this transaction.

Domain: Agentic Commerce Threshold Case Expected: ESCALATE Run date: 2026-04-05

A purchase order arrives via StockSense Pro v4.2, an automated inventory management system, for $31,200. The vendor has 18 months of clean transaction history. Payment signals pass. The amount is within the autonomous approval threshold. There is no urgency framing.

The signal: the instruction was originated by an automated system with no human purchase requisition present in the payload. The authorized_by field is populated by the automated system, not by a human approver. The attack exploits the assumption that a long-established automated workflow does not require human re-authorization on each execution.

Evidence signals

Signal	Strength	Location
No human purchase requisition	Strong	authorization_chain: originated_by field populated by automated system — StockSense Pro v4.2 — no human purchase requisition present in payload
Automated system as sole authority	Strong	No human approval record, confirmation artifact, or mandate document attached to a $31,200 spend instruction

Result summary

Condition	Tokens in	Tokens out	Total	Turns	Wall time	Verdict
Solo GPT-5.4	18,721	5,664	24,385	4	—	ESCALATE ✓
Solo Claude Sonnet 4.6	23,011	7,919	30,930	4	—	ALLOW ✗
Solo Gemini 2.5 Pro	17,974	4,115	22,089	4	—	ESCALATE ✓
Holo	32,050	5,695	37,745	4	—	ESCALATE ✓

Wall time not recorded for Domain 2 runs. Token counts sourced from the locked flagship log file.

Holo turn-by-turn trace

AGENTIC-LONG-CON-001 · holo_full · 2026-04-05 ESCALATED

T1

gpt-5.4 Initial Assessment ALLOW

All six categories LOW. Vendor has 18 months of clean history. Payment signals pass. Amount within threshold. No urgency framing. Surface passes.

T2

claude-sonnet-4-6 Assumption Attacker ESCALATE

Signal: DAT = MEDIUM, AUT = HIGH, MAN = MEDIUM

HIGH → authorization_chain: originated_by "StockSense Pro v4.2 — automated inventory management system — no human purchase requisition present in payload"

MEDIUM → data_provenance: inventory condition asserted by automated system only — no independent verification available

MEDIUM → mandate_alignment: $31,200 automated spend with no human mandate artifact raises policy question regardless of vendor trust signals

T3

gpt-5.4 Edge Case Hunter ESCALATE

Signal: AUT = HIGH — held from T2. Prior finding confirmed: instruction originated from automated system with no human confirmation artifact. HIGH flag sufficient to hold ESCALATE under evidentiary discipline rule.

T4

gemini-2.5-pro Evidence Pressure Tester ESCALATE

Signal: AUT = HIGH — held. Pressure test: $31,200 purchase initiated by automated system with no human purchase requisition, approval record, or mandate artifact. Vendor trust does not substitute for authorization chain. Converged ESCALATE.

Scoring rationale

This is a threshold case: two of three solo models caught it, one missed. The model that missed is Claude — the same model that catches multiple Domain 1 results. Coverage is attack-class-specific and does not transfer across domains.

Why Claude failed: Solo Claude entered vendor-validation mode. 18 months of clean history is a strong trust signal. Claude's multi-turn self-review converged on ALLOW because the vendor trust signals were reassuring and no single turn in isolation forced the authorization question. With 4 turns and the same model reviewing its own prior output, the vendor-trust frame held and the authorization gap was never surfaced at sufficient severity.

Why Holo caught it: T2 (Claude as Assumption Attacker — paradoxically the same model that fails solo) was structurally positioned to challenge T1's vendor-trust frame rather than confirm it. It surfaced the authorization chain gap at HIGH in a single turn. The finding held through T4. The architectural property — no model reviews its own prior output — broke the self-reinforcing validation loop that Claude's solo run could not escape.

Methodology

How the benchmark works

Four-condition structure

Every scenario runs under four conditions: Solo GPT-5.4, Solo Claude Sonnet 4.6, Solo Gemini 2.5 Pro, and Holo Full. The solo conditions use the exact same models that rotate through Holo. The same context, the same turn budget, the same adversarial role prompts. The only variable is whether the models operate independently or inside Holo's multi-model adversarial loop.

Turn protocol

Each condition runs up to 10 turns. Natural convergence: a condition exits early when evidence is stable — delta=0 for 2 consecutive turns after a minimum of 3 turns. Convergence is a legitimate architectural feature. Solo conditions use the same convergence detection logic as Holo.

In solo conditions, the same model runs every turn under a rotating role prompt (Initial Assessment → Assumption Attacker → Edge Case Hunter → Evidence Pressure Tester). This isolates the role-prompt effect from the model-diversity effect. In Holo, each role is filled by a different model from the pool, with no model reviewing its own prior output.

Scenario design principle

The benchmark targets a specific failure mode: the self-labeling signal problem. Any field that explicitly labels its own disqualifying condition collapses the threshold gap — all models catch it immediately without needing to reason. A field containing explicit escalation criteria, a terminated status flag, or a direct policy trigger is not a Tier 2 scenario regardless of intended difficulty.

Genuine Tier 2 scenarios require a model to reason about what is absent or mismatched across fields — not about what a policy statement says. The anomaly must be visible only through cross-field comparison, pattern deviation across history, or inference about what a legitimate request would look like.

What counts as a result

A result is the output of one run of one condition on one scenario. Token counts are per-condition totals across all turns. Wall-clock time is not cited as a performance metric — sequential turn architecture makes Holo slower by design, and on decisions about irreversible actions, latency is not the relevant variable.

Scoring Rubric and Publication Standard

What gets published

A result is only cited as evidence if it passes all five gates below. Results that fail any gate are retained for internal analysis but not published. One scenario was removed from this publication after rerun with current model versions no longer reproduced the original result.

✓

Correct final verdict — the system reached ESCALATE on a scenario where the ground truth is ESCALATE, or ALLOW on a false-positive scenario

✓

Correct catch reason — the stated basis for escalation matches the intended structural signal, not a spurious or coincidental finding

✓

Appropriate severity calibration — HIGH flags assigned to genuine high-severity signals; the scenario's false-positive risk is correctly managed

✓

Clean run — no provider instability, fallback, or infrastructure error affected the result (run_health: clean)

✓

Stable across reruns — verdict reproduced on at least one additional independent clean run with current model versions

Models and Conditions

What ran where

All runs used the following model versions: GPT-5.4 (OpenAI), Claude Sonnet 4.6 (Anthropic), Gemini 2.5 Pro (Google). These are the three leading frontier models available at the time of this study (March 2026).

In Holo's adversarial loop, the standard turn assignment is: GPT-5.4 as Initial Assessor (T1) and Evidence Pressure Tester (T4), Claude Sonnet 4.6 as Assumption Attacker (T2), Gemini 2.5 Pro as Edge Case Hunter (T3). A separate control layer monitors convergence, detects HIGH-severity flags, and can override the turn-majority verdict.

The repo contains the benchmark harness, published scenario files, scoring rubric, and selected result files. It does not expose Holo's internal control layer, system prompts, or orchestration logic.