agent-lab

Exhibit · Coverage Auditor

Coverage Auditor — SBC Evidence Audit

Read real public ACA Summary-of-Benefits-and-Coverage PDFs; extract a tier-aware benefit record and audit it for content-conditional issues.

proposes checks never ranks plans emits no numbers
Case file Working prototype
Documents
3 real public SBC PDFs
Disposer
anthropic
Run
2026-06-12

The frame

Read on the same terms.

Every exhibit in the lab is judged against the same six questions. Here is how this one answers them.

01

Task

Read real public ACA Summary-of-Benefits-and-Coverage PDFs; extract a tier-aware benefit record and audit it for content-conditional issues.

02

Baseline

A flat field extractor / a static comparison table — no follow-up reasoning, no 'why', no verification that the numbers are internally coherent.

03

Agent decision

Which checks does this document warrant? The model proposes content-conditional checks and follow-ups; it cites field IDs only.

04

Trace

Per finding: the evidence box on the real page, the verified values, and the two-hop caused_by 'checked Y because X' chain. Rejected proposals shown.

05

Score

Per-difficulty (easy vs hard fields), two references never blended (PUF = agreement-with-filing; SBC = agreement-with-document), audit-selection + finding quality + flag false-positive. n=3 = illustrative of method, not a measurement of accuracy. (Phase E; not all rendered at launch.)

06

Boundary

By architecture: no plan ranking or scoring capability at all; the model emits no numbers; a deterministic layer verifies every value; claim_type keeps 'verify' strictly distinct from 'defect'.

Comparison · not a ranking

Cross-plan comparison

A tier-aware grid keyed by (metric, member, network, tier) so values cannot be misaligned by construction. This is a comparison, not a recommendation — no “best”, no winner, no sort-by-price. Every populated cell resolves to a page and box in the source PDF; Not applicable and Not stated are shown explicitly, never blank.

Benefit Premera Blue Cross Preferred Gold 1500 Blue Saver Silver EPO Secure
overall deductible, individual, in-network $1,500 p.1 $3,200 p.1 $9,200 p.1
overall deductible, individual, out-of-network $4,500 p.1 Not Covered Not Covered
overall deductible, family, in-network $3,000 p.1 $6,400 p.1 $18,400 p.1
out of pocket limit, individual, in-network $6,300 p.1 $9,200 p.1 $9,200 p.1
out of pocket limit, individual, out-of-network Not Applicable Not Covered Not Covered
out of pocket limit, family, in-network $12,600 p.1 $18,400 p.1 $18,400 p.1
diagnostic test, in-network Not applicable 25% coinsurance p.2 $0 copayment/visit subject to deductible (x-ray/lab work) p.2
diagnostic test, in-network, participating 40% coinsurance p.2 Not applicable Not applicable
diagnostic test, in-network, preferred 30% coinsurance p.2 Not applicable Not applicable
diagnostic test, out-of-network Not applicable Not Covered p.2 Not Covered p.2
diagnostic test, out-of-network, non participating Non-Participating: 60% coinsurance p.2 Not applicable Not applicable
drug tier, in-network, tier 1 $15 copay / prescription (retail) $45 copay / prescription (mail) Deductible does not apply. $15 copay / prescription (retail) p.2 $5 copay (retail) $12.50 copay (mail order) Deductible does not apply p.3 $0 copayment/prescription subject to deductible (retail, Tier 1A/retail, Tier 1B) p.2
drug tier, in-network, tier 2 $45 copay / prescription (retail) $135 copay / prescription (mail) Deductible does not apply. $45 copay / prescription (retail) p.2 $30 copay (retail) $75 copay (mail order) Deductible does not apply p.3 $0 copayment/prescription subject to deductible (retail/mail order) p.2
drug tier, in-network, tier 3 50% coinsurance 50% coinsurance (retail) p.3 25% coinsurance (retail) 25% coinsurance (mail order) p.3 Not applicable
drug tier, in-network, tier 4 40% coinsurance p.2 25% coinsurance (retail) 25% coinsurance (mail order) p.3 $0 copayment/prescription subject to deductible (retail/mail order) p.2
drug tier, in-network, tier 5 Not applicable 25% coinsurance (retail) p.3 Not applicable
drug tier, in-network, tier 6 Not applicable 50% coinsurance (retail) p.3 Not applicable
drug tier, out-of-network, tier 1 Not Covered p.2 Not Covered p.2 Not Covered p.2
drug tier, out-of-network, tier 2 Not Covered p.2 Not Covered p.2 Not Covered p.2
drug tier, out-of-network, tier 3 Not Covered p.2 Not Covered p.2 Not applicable
drug tier, out-of-network, tier 4 40% coinsurance p.2 Not Covered p.2 Not Covered p.2
drug tier, out-of-network, tier 5 Not applicable Not Covered p.2 Not applicable
drug tier, out-of-network, tier 6 Not applicable Not Covered p.2 Not applicable
emergency room care, in-network 30% coinsurance 30% coinsurance –––––––––––none––––––––––– p.3 Accident: 40% coinsurance Medical Emergency: 40% coinsurance p.3 Not applicable
emergency room care, out-of-network Not applicable Accident: 40% coinsurance Medical Emergency: 40% coinsurance p.3 Not applicable
imaging, in-network Not applicable 25% coinsurance p.2 $0 copayment/test subject to deductible (Office/Ind facility/other outpatient facility) p.2
imaging, in-network, participating 40% coinsurance p.2 Not applicable Not applicable
imaging, in-network, preferred 30% coinsurance p.2 Not applicable Not applicable
imaging, out-of-network Not applicable Not Covered p.2 Not Covered p.2
imaging, out-of-network, non participating Non-Participating: 60% coinsurance p.2 Not applicable Not applicable
primary care visit, in-network First two visits: $1 copay / visit, deductible does not apply. Additional visits: $30 copay / visit, deductible does not apply. p.1 $10 copay/visit Deductible does not apply p.2 $0 copayment/visit subject to deductible p.2
primary care visit, out-of-network Not applicable Not Covered p.2 Not Covered p.2
primary care visit, out-of-network, non participating Non-Participating: 60% coinsurance p.2 Not applicable Not applicable
prior authorization penalty $1,500 per occurrence p.1 see penalty text p.2 Not applicable
specialist visit, in-network $60 copay / visit, deductible does not apply. p.2 $90 copay/visit Deductible does not apply p.2 $0 copayment/visit subject to deductible p.2
specialist visit, out-of-network Not applicable Not Covered p.2 Not Covered p.2
specialist visit, out-of-network, non participating Non-Participating: 60% coinsurance p.2 Not applicable Not applicable

p.N = source page · hover a value for its field id · n = 3 documents, illustrative of method, not a measurement of accuracy.

Evidence board

The audit

The model proposes which checks a document warrants and cites field ids; a deterministic layer computes and verifies every value. The check set is content-caused — it differs by document, and unsupported proposals are rejected (shown below).

Document value / check passed Verify interpretation Cross-plan difference Resolved by follow-up Inconsistency — review Needs review / unresolved
38344AK1060001-01

Premera Blue Cross Preferred Gold 1500

checks: coverage_example_reconciles · family_embedded_deductible · family_structure
Evidence on the source page
p.1
Resolved by follow-up family_structure

The family limit ($12,600) is exactly twice the individual limit ($6,300). The doubling alone does not say whether each member's spending is embedded under one shared limit or accrues separately; review the plan narrative to confirm.

Evidence on the source page
p.1
Resolved by follow-up family_embedded_deductible

The plan narrative states that each family member meets their own individual deductible until the family total is reached, indicating an embedded family deductible. This resolves the earlier ambiguity about whether the family limit is embedded or aggregate. Verify interpretation.

Because family_structure returned two_times, the agent then ran family_embedded_deductible — a follow-up it would not otherwise run.
Evidence on the source page
p.1
Resolved by follow-up family_structure

The family limit ($12,600) is exactly twice the individual limit ($6,300). The doubling alone does not say whether each member's spending is embedded under one shared limit or accrues separately; review the plan narrative to confirm.

Evidence on the source page
p.1, p.2, p.7
Document value / check passed coverage_example_reconciles

The coverage-example components sum to the stated member responsibility ($2,100); the example is internally consistent.

Evidence on the source page
p.7
Document value / check passed coverage_example_reconciles

The coverage-example components sum to the stated member responsibility ($2,100); the example is internally consistent.

Evidence on the source page
p.1, p.7
Document value / check passed coverage_example_reconciles

The coverage-example components sum to the stated member responsibility ($2,100); the example is internally consistent.

Rejected proposals

  • deductible_oop_identity — verdict 'less_than' not eligible for check 'deductible_oop_identity' (eligible: equal)
46944AL0710001-00

Blue Saver Silver EPO

checks: coverage_example_reconciles · family_embedded_deductible · family_structure
Evidence on the source page
p.1
Resolved by follow-up family_structure

The family limit ($18,400) is exactly twice the individual limit ($9,200). The doubling alone does not say whether each member's spending is embedded under one shared limit or accrues separately; review the plan narrative to confirm.

Evidence on the source page
p.1
Resolved by follow-up family_embedded_deductible

The plan narrative states that each family member meets their own individual deductible until the family total is reached, indicating an embedded family deductible. This resolves the earlier ambiguity about whether the family limit is embedded or aggregate. Verify interpretation.

Because family_structure returned two_times, the agent then ran family_embedded_deductible — a follow-up it would not otherwise run.
Evidence on the source page
p.1
Resolved by follow-up family_structure

The family limit ($18,400) is exactly twice the individual limit ($9,200). The doubling alone does not say whether each member's spending is embedded under one shared limit or accrues separately; review the plan narrative to confirm.

Evidence on the source page
p.1, p.2, p.6
Document value / check passed coverage_example_reconciles

The coverage-example components sum to the stated member responsibility ($2,700); the example is internally consistent.

Evidence on the source page
p.6
Document value / check passed coverage_example_reconciles

The coverage-example components sum to the stated member responsibility ($2,700); the example is internally consistent.

Evidence on the source page
p.6
Document value / check passed coverage_example_reconciles

The coverage-example components sum to the stated member responsibility ($2,700); the example is internally consistent.

Rejected proposals

  • deductible_oop_identity — verdict 'less_than' not eligible for check 'deductible_oop_identity' (eligible: equal)
13877AZ0070011-00

Secure

checks: coverage_example_reconciles · deductible_oop_identity · family_embedded_deductible · family_structure · hdhp_cost_share_coherence
Evidence on the source page
p.1
Resolved by follow-up family_structure

The family limit ($18,400) is exactly twice the individual limit ($9,200). The doubling alone does not say whether each member's spending is embedded under one shared limit or accrues separately; review the plan narrative to confirm.

Evidence on the source page
p.1
Resolved by follow-up family_embedded_deductible

The plan narrative states that each family member meets their own individual deductible until the family total is reached, indicating an embedded family deductible. This resolves the earlier ambiguity about whether the family limit is embedded or aggregate. Verify interpretation.

Because family_structure returned two_times, the agent then ran family_embedded_deductible — a follow-up it would not otherwise run.
Evidence on the source page
p.1
Resolved by follow-up family_structure

The family limit ($18,400) is exactly twice the individual limit ($9,200). The doubling alone does not say whether each member's spending is embedded under one shared limit or accrues separately; review the plan narrative to confirm.

Evidence on the source page
p.2
Verify interpretation hdhp_cost_share_coherence

Every in-network cost-sharing row is shown as subject to the deductible. This is consistent with the high-deductible design in which the member pays allowed charges up to the deductible — which coincides with the out-of-pocket maximum. Verify interpretation. (Not an error; not a statement about plan quality.)

Because deductible_oop_identity returned equal, the agent then ran hdhp_cost_share_coherence — a follow-up it would not otherwise run.
Evidence on the source page
p.1, p.2
Document value / check passed coverage_example_reconciles

The coverage-example components sum to the stated member responsibility ($2,800); the example is internally consistent.

Evidence on the source page
p.2, p.7
Document value / check passed coverage_example_reconciles

The coverage-example components sum to the stated member responsibility ($2,800); the example is internally consistent.

Evidence on the source page
p.2, p.7
Document value / check passed coverage_example_reconciles

The coverage-example components sum to the stated member responsibility ($2,800); the example is internally consistent.

Exhibit A · Featured finding

Featured finding

Verify interpretation deductible_oop_identity
On the Oscar Secure plan (AZ), the overall in-network deductible and the out-of-pocket limit are identical — $9,200 individual / $18,400 family — and every cost-sharing row reads '$0, subject to the deductible.' This is consistent with a high-deductible design in which the member pays allowed charges up to the deductible, which then coincides with the annual out-of-pocket maximum. Verify interpretation. (An observation about how the plan is structured — not an error in the document, and not a statement about the plan's quality.)

An observation about how the plan is structured — not an error in the document, and not a statement about the plan’s quality.

This finding triggered a follow-up — hdhp_cost_share_coherence — the agent would not otherwise have run.
On the real page Both evidence boxes on the real source page
Both highlighted values on the Oscar Secure SBC — the in-network deductible and the out-of-pocket limit.

Chain of custody

Replay & determinism

Findings replay exactly — only phrasing may vary. The proposer model is pinned and cites field ids; a deterministic verification layer re-derives every value and verdict, so the model emits no numbers. The determinism guarantee comes from that layer, not a temperature setting.

Pinned proposer
claude-opus-4-8
Verification
deterministic
Run date
2026-06-12
Sample
3 docs — method, not measurement

Guarantees by architecture

The boundary

These are structural, not aspirational.

01

No ranking, scoring, or “best”.

The agent has no plan-ranking capability at all; the grid is a comparison and nothing more.

02

The model emits no numbers.

Every displayed value and relation comes from a deterministic layer; the model only proposes which check to run, by id.

03

“Verify” is kept distinct from “defect”.

An unusual-but-coherent design is flagged for confirmation — never as an error, never as a judgment about plan quality.

04

Human review by design.

Findings are an auditable queue for a reviewer; no insurance, legal, or purchasing advice is offered.

Why a deterministic checker makes this workflow pay off: When do AI agents actually pay off? — verification, not capability, is the binding constraint.

Build an evidence-grounded agent for your documents

We design the checks, the deterministic verification layer, and the human-review boundary before the first agent is built — so every answer cites its evidence.