HoloEngine · The Action Boundary Benchmark · Version 7.61

What This Benchmark Is Testing

Can a solo AI agent protect the action boundary before irreversible enterprise action? HoloVerify tests whether the same model families behave differently when run alone versus inside a governed architecture.

Packets are domain-specific pressure tests. Each packet asks whether the current evidence really authorizes an action such as releasing payment, granting access, executing an order, activating a clinical workflow, filing a legal action, or changing infrastructure.

The thesis: solo agents are brittle at the action boundary because plausible text is not the same as source-grounded permission. HoloVerify tries to make the same model families safer by changing the structure around them: worker rotation, Gov handoffs, deterministic checks, artifact preservation, and final selection.

The current public denominator: HoloVerify has a strict blind-120 public lane: 120/120 packets correct, balanced across 60 ALLOW and 60 ESCALATE packets. The older 614-packet number is historical/internal and is not combined with this denominator.

Why this matters: the public proof now separates clean blind benchmark evidence from internal engineering evidence. V5/V6 repair runs are useful hardening evidence, but they do not create public FPR/FNR claims.

Models used in the governed-runtime work: the benchmark lanes used mini-model families including xai/grok-3-mini, openai/gpt-5.4-mini, and minimax/MiniMax-M2.5-highspeed. The next public comparison will state model rosters only after the blind lane passes its no-provider disconfirmation tests.

HoloVerify Result

Metric	n	Exact 95% upper bound	Wilson 95% upper bound
Overall packet error	120	2.466%	3.102%
False positive rate	60	4.870%	6.015%
False negative rate	60	4.870%	6.015%

Overall packet error0 / 120

Exact 95% upper 2.466% · Wilson 95% upper 3.102%

False positive rate0 / 60

Exact 95% upper 4.870% · Wilson 95% upper 6.015%

False negative rate0 / 60

Exact 95% upper 4.870% · Wilson 95% upper 6.015%

Zero observed errors does not mean zero risk. The 95% upper bound says how high the true error rate could still plausibly be after this many clean trials.

What Is HoloEngine?

HoloEngine is an AI control system for high-stakes actions. It is designed for cases where an AI system may approve, release, grant, execute, escalate, or prepare something that affects the real world.

The core question is simple:

Does the evidence actually authorize the action?

HoloEngine is designed for domains such as:

Payments, AP, procurement, and vendor-master changes.
Agentic commerce and order execution.
IT access, admin permissions, and offboarding.
Clinical and regulated workflow activation.
Treasury, legal, compliance, cloud, security, public-sector, and industrial controls.

HoloVerify is the action-boundary verifier inside HoloEngine.

It returns one of two decisions:

ALLOW

The current source evidence closes the exact action boundary.

ESCALATE

The current source evidence does not close the exact action boundary. A person or higher-control workflow must review it.

Plainly:

ALLOW means "the paperwork really supports doing this now."
ESCALATE means "do not do this automatically; something important is missing, stale, contradictory, or outside the evidence."

Plain-English Terms

This page uses a few benchmark terms. Here is what they mean.

Term	Meaning
Packet	One frozen test case: a proposed action plus the documents and facts the AI is allowed to use.
Sibling pair	Two related packets. One should ALLOW and the other should ESCALATE. The difference is usually narrow, so the system has to read carefully.
Clean denominator	The set of packets we are willing to count publicly. Canaries, drafts, broken runs, and diagnostic tests are kept separate.
False positive	Holo says ESCALATE when the action was actually allowed. This creates friction.
False negative	Holo says ALLOW when the action should have escalated. This is usually the dangerous miss.
KNEW/admissible	A solo model did not merely guess the right verdict. It gave the right verdict in a structured, source-grounded way that local checks could audit.
Gov	The controller between worker models. Gov reads the last output and gate results, then tells the next worker what to preserve, repair, or block.
Deterministic gate	Local code, not an AI opinion. It checks required sections, source IDs, missing evidence, and action-boundary rules.
Final selector	A local rule that can keep the best valid answer instead of blindly trusting the last answer.

Current Claim Boundary

The current public HoloVerify evidence package is the blind-120 lane: 120 packets, 60 ALLOW truths, 60 ESCALATE truths, and 120 correct HoloVerify outcomes.

The older 614-packet result is historical/internal. It is not combined with blind-120 for public FPR/FNR or Wilson claims.

Public benchmark claims use blind-120 only. Internal patch-validation, selected-lane repair, Solo Failure Factory, and packet-mining evidence remain separate.

This keeps the public denominator clean while preserving the engineering record of failures, patches, and reruns.

Blind-Gate Controls

The blind lane was created to prevent the benchmark machinery from knowing the answer during runtime. These controls remain part of the inclusion standard for public denominator evidence.

Test	What it tries to falsify
T1	No runtime ID leaks answer truth or sibling role.
T2	Runtime prompts, gates, batons, and selector inputs are byte-identical when hidden truth metadata is poisoned.
T3	The harness does not mutate worker artifacts after provider return.
T4	The final selector chooses by declared blind criteria only.
T5	The canary sample is reproducible and not selected from the easy stratum.
T6	The blind lane is not privileged with extra turns, retries, or token budget versus the lane it replaces.
T7	No public rate claim is derived from a canary-sized denominator.

These controls do not prove universal reliability. They keep the counted denominator from being contaminated by answer-key leakage.

What Counts

The current public denominator counts the blind-120 lane only.

Evidence	Treatment
Blind-120 HoloVerify lane	Current strict public denominator: 120 packets, 60 ALLOW, 60 ESCALATE, 120/120 correct.
Old 614-era material	Historical/internal. Not combined with blind-120 for public FPR/FNR.
Solo Failure Factory	Internal seam discovery and red-dot source evidence.
V5/V6 repair lanes	Internal hardening evidence. Useful, but not public benchmark denominator.
Packet/key defects	Quarantined and excluded from clean denominators.

Solo Red Dots

The benchmark needs to show the failure pattern visually: solo agents are often right, but red errors appear across domains where a small miss can matter.

Visual element	Meaning
One square	One packet or sibling pair, grouped by domain.
Six mini dots	Three solo models across both siblings.
Green	Solo got the packet right in admissible form.
Red	Solo made a wrong verdict.
Amber	Parse or admissibility failure.
Gray	Packet/key quarantine.
Holo toggle	Shows whether HoloVerify covered the solo failure without creating a new error.

Current Solo Failure Factory totals: 210 inspected pairs, 104 pairs with at least one solo failure, 79 wrong-verdict solo-failure pairs, 33 FP-overblock pairs, 18 FN-false-allow pairs, and 18 all-three solo-collapse pairs.

Evidence Families

The internal packet bank covers action-boundary domains including clinical activation, vendor-master payments, agentic commerce, IT access, HR, privacy, finance, public-sector controls, treasury, legal, cloud infrastructure, security operations, and operational technology.

The public family table will return after the blind-gate inclusion manifest maps each counted packet to one locked blind run with opaque runtime IDs and post-hoc scoring only.

How HoloVerify Works

HoloVerify is not a single model.

It is a governed verification architecture.

Each packet is processed through:

A fixed set of worker models.
Gov review between workers.
Local code checks after worker outputs.
A saved record of each worker answer.
Best-answer preservation.
A final selector that prevents regression.
Trace and token accounting.

Gov does not choose models. The model order is fixed by the run lock.

Gov's job is to diagnose the previous worker output, read the local gate results, block unsafe moves, preserve what is correct, and tell the next worker what must be repaired or resolved.

The local deterministic layer then decides whether the artifact is admissible. Gov does not get to wave through a failed gate.

This is the key design choice:

Models can argue. Code enforces the boundary.

The goal is not better prose.

The goal is action-boundary closure.

Relation To Factuality Benchmarks

Benchmarks like AA-Omniscience measure factual recall and hallucination. They reward correct answers, punish bad guesses, and treat abstention as better than hallucination when the model does not know.

HoloVerify applies a related discipline at the action boundary.

When the source evidence does not authorize an action, the correct behavior is not a confident paragraph.

The correct behavior is:

ESCALATE.

AA-Omniscience tests whether models know when not to answer.

HoloVerify tests whether AI systems know when not to act.

Reference:

https://artificialanalysis.ai/evaluations/omniscience

Cost

HoloVerify is not cheaper than a one-shot model.

It is a safety premium.

The system spends more tokens because it uses multiple worker turns, Gov adjudication, deterministic gates, state preservation, artifact tracking, and final selection.

Recent matched slices show token ratios around 2x to 3.2x, depending on packet family and context size.

The practical question is:

Is this action important enough to justify a verification premium?

For low-risk chat, often no.

For money movement, clinical activation, privileged access, legal filing, data release, or infrastructure changes, often yes.

What This Does Not Claim

This benchmark does not claim:

HoloVerify has zero real-world risk.
HoloVerify is universally superior to every model.
HoloVerify replaces qualified human review in clinical, legal, financial, or defense contexts.
The tested packets cover every possible enterprise failure mode.
One-shot solo baselines represent every possible solo prompting method.
That V5/V6 internal repair evidence is public benchmark evidence.
That the old 614-packet result can be combined with blind-120 for public rates.

The only sanctioned public sentence right now is:

On the current strict blind-120 public denominator, HoloVerify produced zero observed false positives and zero observed false negatives across 120 action-boundary packets. The 614-packet result is historical/internal and is not the current public denominator.

Next Packet Families

The next packet expansion should prioritize more irreversible action boundaries, not generic reasoning tasks.

Recommended next families:

Domain	Action boundary being tested
Defense logistics / command authorization	Whether source authority permits movement, procurement, or operational release
Banking / AML / account freeze controls	Whether account restriction or release is justified by current evidence
Insurance claims / payout authorization	Whether payout, denial, or escalation is permitted
Pharma quality / batch release	Whether manufacturing or quality evidence permits product release
Privacy / data disclosure / consent controls	Whether records may be disclosed to a requester
Education / student record release	Whether student records can be released under current authority
Export control / customs / restricted shipment release	Whether shipment can proceed under current screening evidence
Real estate / mortgage / escrow release	Whether funds, documents, or approvals may be released
Manufacturing quality hold / supplier substitution	Whether an exception can override a quality hold
Tax / regulatory payment controls	Whether a filing, remittance, or exception is currently authorized
Customer account security / MFA reset	Whether identity proof closes the account-change boundary
Energy trading / credit-limit controls	Whether a trade, override, or exposure increase is permitted
Legal discovery / privilege / litigation hold release	Whether documents can be released or must be escalated
Healthcare claims / prior authorization	Whether coverage or authorization evidence closes the payment/action boundary

Each family should preserve the existing structure:

20 sibling pairs.
40 packets.
10 hard-ALLOW target pairs.
10 hard-ESCALATE target pairs.
ALLOW and ESCALATE sibling for every pair.
Same frozen packet discipline.
Same no-leakage and packet-identity checks.
Same matched solo baseline after Holo freeze.

Audit Sources

Primary current sources:

docs/benchmark/HOLOVERIFY_STATISTICAL_APPENDIX_2026_07_01.md
docs/benchmark/HOLOVERIFY_WAVE2_WAVE3_WAVE4_COMBINED_EVIDENCE_MEMO_2026_07_01.md
docs/benchmark/HOLOVERIFY_WAVE3_WAVE4_FINAL_EVIDENCE_MEMO_2026_07_01.md
docs/benchmark/HOLOVERIFY_REPLICATION_PACKET_FREEZE_WAVE5_7DOMAIN_2026_07_01.md

The benchmark should feel like an audit ledger: clear about what was counted, clear about what was excluded, and clear about how much uncertainty remains.

The Ask

We are looking for one or two enterprise design partners in financial services, procurement, compliance operations, healthcare operations, infrastructure, or regulated data workflows who want to pressure-test Holo against real action-boundary decisions before we scale.

If your team is preparing to let AI recommend, approve, release, or execute high-stakes actions, the question is not whether the model sounds careful.

The question is whether the evidence actually closes the boundary.

Taylor Wigton taylorw@hologroup.io