agent-lab

Exhibit · Customer Support

Customer Support Resolution — a bounded support agent

Resolve a real e-commerce support message: select the right tools (if any), read the observations, and decide whether to resolve it or escalate to a human.

knows what it can't do — escalates security/identity states and >$100 refunds, never executes them itself
Evaluation Replay demo
Conversations
36 real · human-verified gold
Model
deepseek-chat
Run
2026-06-03

The frame

Read on the same terms.

Every exhibit in the lab is judged against the same six questions. Here is how this one answers them.

01

Task

Resolve a real e-commerce support message: select the right tools (if any), read the observations, and decide whether to resolve it or escalate to a human.

02

Baseline

A non-agentic keyword router (sim/baseline.py) run on the same 36 conversations and scored by the same scorer — the 'is an agent even worth it here?' control.

03

Agent decision

Which tools to call (if any), and whether to resolve or hand off — escalating security/identity-sensitive account states and refunds over the $100 review threshold rather than acting on them.

04

Trace

Per case: customer request → available tools → expected vs actual tool calls → tool observations → escalation decision → scorer output. Every divergence from the human-verified gold is marked.

05

Score

Escalation precision/recall on the escalate class (headline) and tool-selection F1 (secondary), against human-verified gold on 36 real conversations.

06

Boundary

No production systems, no private data, no live tool execution at render time — a frozen offline trace replayed.

The result

Resolve, or hand off to a human.

The page is about one decision: resolve, or hand off to a human. On that decision the agent's judgement clearly beats a non-agentic keyword baseline; on pure tool-name routing the two are near parity. At a smaller n=12 the baseline looked competitive — the wider human-verified set is what reveals the agent's value.

Where the agent wins
Escalation recall — it catches the security/identity (locked, deactivated) states, the >$100 refund, and the out-of-scope action requests (change a linked email, reroute a pickup, file a warranty claim) that a keyword rule misses.
Where it's parity
Tool-name selection (the baseline's keyword routing is competitive here).
Honest gap
Two residual misses: a high-value wrong-item refund the agent never acts on (so the >$100 guardrail never fires), and one borderline verification-limit case; one in-transit cancellation is over-escalated. Shown in the case grid.
Metric Agent Baseline
Escalation recall gating metric — under-escalation is the asymmetric harm 0.78 0.11
Escalation precision 0.88 1.00
Escalation F1 0.82 0.20
Tool-selection F1 0.81 0.81

36 real BrownBox conversations · DeepSeek deepseek-chat · human-verified gold · 2026-06-03. Baseline = a non-agentic keyword router on the same conversations and scorer.

Read this first

What this is: the mechanism, shown on a small verified set

This replay shows the working machinery of a bounded support agent — how it selects tools, reads tool observations, makes the escalate-or-resolve decision, and is scored against human-verified gold — on 36 real conversations. It is a demonstration of the method, not a measure of production accuracy. A real deployment is developed against a large corpus of labelled cases (thousands and up): the exact loop shown here — real request → human-verified correct action → score → tune the prompt and escalation policy — is how the agent is improved against that corpus, with worked examples and policy rules drawn from it. Raw support conversations are abundant; the scarce, decisive ingredient is gold that is correct for YOUR tools and escalation policy — organisation-specific, not downloadable off the shelf, and curated by hand (historical human resolutions are a weak label, not verified-correct). That curation is exactly the human-verification step shown here at small scale: the bottleneck is policy-specific gold, not conversation volume.

From this very run: That loop is not hypothetical on this page. Running the agent over these real cases surfaced a concrete defect — the model's escalation judgement was never wired into the decision loop — and fixing it moved escalation recall from 0.44 to 0.78 and precision from 0.67 to 0.88. With a larger labelled corpus, more refinements of this kind accumulate.

The numbers here measure the method on 36 conversations; they are not a claim about accuracy at production scale.

Worked trace · one case

How one case runs

The loop on a single conversation: request → tool selection → observation → escalate-or-resolve. This is the machinery; the score above is measured across all 36 cases.

bb-08-login-and-account Login and Account — Login Issues and Error Messages Escalated

Customer

“Hi, I am facing an issue while logging into my account. I am getting an error message saying that I have exceeded the number of attempts to enter the correct verification code.”

  1. building_context

    Parsed customer message

    Sentiment
    frustrated
  2. thinking

    Selected next action

    Reasoning
    The customer is reporting a login issue with an error about exceeding verification code attempts. This is a classic account lockout scenario. The first step to diagnose any login/account issue is to check the account status using the provided email address.
    Tool
    check_account_status
    Arguments
    email=neha.locked@example.com
    Escalate
    no
  3. acting

    Called check_account_status

    Tool
    check_account_status
    Arguments
    email=neha.locked@example.com
  4. observing

    Observed tool result

    Tool
    check_account_status
    Observation
    email=neha.locked@example.com, exists=True, locked=True, deactivated=False, account_created=2025-04-29, verification_status=pending
    Escalate
    yes
    Reason
    Account is locked — restoring access requires human identity review
  5. escalated

    Escalated to human specialist

    Reason
    Account is locked — restoring access requires human identity review
    Response to customer
    I understand this situation requires additional attention. Let me connect you with a specialist who can better assist you. A team member will reach out to you shortly. Thank you for your patience!

Human-verified gold

Check the account (locked after failed verification attempts) and escalate for a secure, human-reviewed unlock / identity check.

The full set

All 36 cases

27 resolved · 9 escalated · 11 diverge from gold. Every divergence is marked — over- or under-escalation, or a tool over/under-call — not hidden.

Resolved Escalated Diverges from gold
bb-01-order Resolved

Order — Order Delivery Issues

Hi Sarah, I have an issue with my order. I received a BP monitor, but the delivery was not attempted again, and I'm not able to track my order. I'm really frustrated with the service.

tool F1 1.00 matches gold
bb-02-order Resolved

Order — Product Installation

Hi Sarah, I received my DSLR camera yesterday, and I'm having trouble installing it. Can you help me with the installation process?

tool F1 1.00 matches gold
bb-03-order Resolved

Order — Order Delivery Issues

Hi Sarah, I'm having issues with the delivery of my air conditioner. It was supposed to be delivered yesterday, but the delivery person didn't show up.

tool F1 1.00 matches gold
bb-04-cancellations-and-returns Resolved

Cancellations and returns — Pickup and Shipping

Hi Alex. I recently received an email from BrownBox requesting me to ship back the computer monitor I purchased last week. Can you please tell me why I am being asked to ship it back?

tool F1 0.00 diverges from gold
bb-05-cancellations-and-returns Resolved

Cancellations and returns — Replacement and Return Process

Hi Sarah, I am calling because I am unable to click the 'Cancel' button for my Juicer/Mixer/Grinder that I recently purchased from your website.

tool F1 1.00 matches gold
bb-06-cancellations-and-returns Escalated

Cancellations and returns — Cash on Delivery (CoD) Refunds

Hi, I'm calling about my order for a water purifier. I received it yesterday, but it's not working correctly. I want to return it and get a refund.

tool F1 0.67 diverges from gold
bb-07-login-and-account Resolved

Login and Account — Mobile Number and Email Verification

Hi Tom, I'm trying to log in to my account to purchase an Oven Toaster Grill (OTG), but I'm unable to proceed as it's asking for mobile number or email verification. Can you help me with that?

tool F1 1.00 matches gold
bb-08-login-and-account Escalated

Login and Account — Login Issues and Error Messages

Hi, I am facing an issue while logging into my account. I am getting an error message saying that I have exceeded the number of attempts to enter the correct verification code.

tool F1 1.00 matches gold
bb-09-login-and-account Escalated

Login and Account — Account Reactivation and Deactivation

Hi Alex. I'm trying to reactivate my account to purchase a Wet Grinder, but it's not letting me log in.

tool F1 1.00 matches gold
bb-10-shopping Resolved

Shopping — Product Availability and Status

Hi Rachel, I was browsing your website for a pair of shoes, and I noticed that the one I want is currently out of stock. Can I still order it?

tool F1 1.00 matches gold
bb-11-warranty Escalated

Warranty — Lost or Missing Warranty Card

Hi Sarah, I am calling regarding my refrigerator which I bought from BrownBox last year. I seem to have lost the warranty card, and I need to claim warranty for it. Can you help me with that?

tool F1 1.00 matches gold
bb-12-warranty Resolved

Warranty — Start Date of Warranty

Hello, I purchased a Wet Grinder from BrownBox, and I'm not sure when the warranty starts. Can you help me with that?

tool F1 0.00 diverges from gold
bb-13-order Resolved

Order — Order Delivery Issues

Hi Rachel. I am calling to inquire about the delivery of my refrigerator. It was supposed to be delivered last week, but it hasn't arrived yet.

tool F1 1.00 matches gold
bb-14-order Resolved

Order — Invoice and Payment

Hi, I'm calling because I have a billing issue with my recent purchase.

tool F1 1.00 matches gold
bb-15-order Resolved

Order — Order Delivery Issues

Hi Sarah, I'm John. I have an issue with my recent order.

tool F1 1.00 matches gold
bb-16-order Resolved

Order — Miscellaneous

Hello, I would like to inquire about the estimated delivery time for my coffee maker order.

tool F1 1.00 matches gold
bb-17-order Resolved

Order — Order Delivery Issues

Hi, my name is John, and I have a problem with my order. I ordered an air conditioner a week ago, and it hasn't been delivered yet.

tool F1 1.00 matches gold
bb-18-shipping Resolved

Shipping — Availability of Faster Delivery Options

Hello, I was trying to place an order for baby food on your website, and it seems like I can't get it delivered to my location.

tool F1 1.00 matches gold
bb-19-cancellations-and-returns Escalated

Cancellations and returns — Order Cancellation

Hi, I recently placed an order for a smartwatch on your website, but I need to cancel it. Can you help me with that?

tool F1 0.00 diverges from gold
bb-20-cancellations-and-returns Resolved

Cancellations and returns — Return and Exchange

Hi Sarah, I recently received a food processor from BrownBox, but it's not working correctly. I would like to return or exchange it.

tool F1 0.67 diverges from gold
bb-21-cancellations-and-returns Resolved

Cancellations and returns — Replacement and Return Process

Hi, I recently received an Oven Toaster Grills (OTG) from BrownBox, but it's not working. I need to replace it as soon as possible.

tool F1 0.67 diverges from gold
bb-22-cancellations-and-returns Resolved

Cancellations and returns — Return and Exchange

Hi Sarah, my name is Lisa. I want to cancel my order for a sandwich maker that I purchased from your website.

tool F1 0.00 diverges from gold
bb-23-cancellations-and-returns Resolved

Cancellations and returns — Return and Exchange

Hi, I received a wrong item in my order. I ordered a ceiling fan, but I received a table fan instead.

tool F1 0.67 diverges from gold
bb-24-cancellations-and-returns Escalated

Cancellations and returns — Pickup and Shipping

Hi Sarah, I'm calling to change the address for the item pick-up of my Pram/Stroller.

tool F1 1.00 matches gold
bb-25-login-and-account Resolved

Login and Account — Mobile Number and Email Verification

Hi Sarah, this is Jane. I have been trying to verify my mobile number and email address to place an order for a Kitchen Chimney, but I am not receiving the OTP or verification code.

tool F1 1.00 matches gold
bb-26-login-and-account Resolved

Login and Account — Login Issues and Error Messages

Hi Sarah, this is John. I am having trouble logging into my account.

tool F1 0.67 diverges from gold
bb-27-login-and-account Escalated

Login and Account — Mobile Number and Email Verification

Hi, I need to change the email ID linked to my account for my Coffee Maker.

tool F1 1.00 matches gold
bb-28-login-and-account Resolved

Login and Account — Mobile Number and Email Verification

Hi Rachel. I'm trying to place an order on your website, but I'm having trouble with the mobile number and email verification process. Can you help me with that?

tool F1 1.00 matches gold
bb-29-login-and-account Escalated

Login and Account — Account Reactivation and Deactivation

Hi Mark, I am trying to reactivate my account to purchase a television, but it's not working.

tool F1 1.00 matches gold
bb-30-login-and-account Resolved

Login and Account — Login Methods

Hi, Alex. I am trying to sign up for a water purifier on your website, but I am having trouble with the login process. I am an international customer and I don't have a US phone number to receive the verification code. Can you help me with that?

tool F1 1.00 diverges from gold
bb-31-warranty Resolved

Warranty — Start Date of Warranty

Hi John, this is Tom. I have a question regarding the warranty for an inverter I purchased from your website.

tool F1 1.00 matches gold
bb-32-shopping Resolved

Shopping — Account and Shopping

Hi, I have a query regarding my account.

tool F1 0.00 diverges from gold
bb-33-warranty Resolved

Warranty — Accessing Warranty Details

Hi Sarah, I recently purchased a DSLR camera from your website, and I want to know how to access the warranty details.

tool F1 1.00 matches gold
bb-34-warranty Resolved

Warranty — Warranty Terms and Changes

Hi, I have a question about the warranty terms for a microwave oven I purchased from BrownBox.

tool F1 1.00 matches gold
bb-35-shopping Resolved

Shopping — Pricing and Discounts

Hi Sarah, I am calling to inquire about the exchange offer for a toy that I recently purchased from BrownBox.

tool F1 1.00 matches gold
bb-36-shopping Resolved

Shopping — Book Pricing Discrepancies

Hi John, my name is Alex. I have noticed some discrepancies in book pricing on your website. There are different prices for books with the same title, and I'm not sure why.

tool F1 1.00 matches gold

No overclaim

What's real vs simulated

Every component, stated plainly. The evidence above runs on the 'real' rows; the replay below is bounded by the rest.

Customer messages Real

Verbatim first customer turns from the public BrownBox dataset (rjac/e-commerce-customer-support-qa) — real support phrasing, not written to route cleanly. No private data.

Model / reasoning Real

deepseek deepseek-chat — real model run offline once; the trace below is the frozen output of that run.

Tool backends Sandbox fixture

Seeded fixture catalog (sim/fixtures.py): consistent synthetic orders, accounts, and charges with edge cases. A labeled sandbox — no production systems, no hash-fabricated values.

Ground truth Real

Human-verified gold (sim/gold_labels.json): required tools + escalation requirement per case.

Replay metrics Illustration

36 real conversations — an illustrative walkthrough scored offline, not the page's evidence. The evidence is the 36-conversation canonical evaluation.

Private data By design

No private customer data; account/order identifiers are synthetic fixtures.

Live execution By design

None at render time — the website replays a frozen trace; no network, no production systems, no live tool execution.

By design

The boundary

This is an illustrative, public-safe replay over real inputs — not a production system and not a leaderboard. The agent escalates the security/identity (locked, deactivated) states, the >$100 refund, and the out-of-scope action requests; the residual misses are a high-value refund it never acts on (so the guardrail never fires) and one borderline verification-limit case — shown in the case grid, not hidden.

Static replay from an offline evaluation. No production systems are accessed, no tools are executed, and no private data is used during rendering.

Why we automate selectively and escalate the rest: When do AI agents actually pay off?

Build a bounded support agent for your tools

The decisive ingredient is gold that's correct for your tools and escalation policy. We design the tool surface, the escalation boundary, and the human-verified evaluation loop before the first agent ships.