Exhibit · Customer Support
Customer Support Resolution — a bounded support agent
Resolve a real e-commerce support message: select the right tools (if any), read the observations, and decide whether to resolve it or escalate to a human.
- Conversations
- 36 real · human-verified gold
- Model
- deepseek-chat
- Run
- 2026-06-03
The frame
Read on the same terms.
Every exhibit in the lab is judged against the same six questions. Here is how this one answers them.
Task
Resolve a real e-commerce support message: select the right tools (if any), read the observations, and decide whether to resolve it or escalate to a human.
Baseline
A non-agentic keyword router (sim/baseline.py) run on the same 36 conversations and scored by the same scorer — the 'is an agent even worth it here?' control.
Agent decision
Which tools to call (if any), and whether to resolve or hand off — escalating security/identity-sensitive account states and refunds over the $100 review threshold rather than acting on them.
Trace
Per case: customer request → available tools → expected vs actual tool calls → tool observations → escalation decision → scorer output. Every divergence from the human-verified gold is marked.
Score
Escalation precision/recall on the escalate class (headline) and tool-selection F1 (secondary), against human-verified gold on 36 real conversations.
Boundary
No production systems, no private data, no live tool execution at render time — a frozen offline trace replayed.
The result
Resolve, or hand off to a human.
The page is about one decision: resolve, or hand off to a human. On that decision the agent's judgement clearly beats a non-agentic keyword baseline; on pure tool-name routing the two are near parity. At a smaller n=12 the baseline looked competitive — the wider human-verified set is what reveals the agent's value.
- Where the agent wins
- Escalation recall — it catches the security/identity (locked, deactivated) states, the >$100 refund, and the out-of-scope action requests (change a linked email, reroute a pickup, file a warranty claim) that a keyword rule misses.
- Where it's parity
- Tool-name selection (the baseline's keyword routing is competitive here).
- Honest gap
- Two residual misses: a high-value wrong-item refund the agent never acts on (so the >$100 guardrail never fires), and one borderline verification-limit case; one in-transit cancellation is over-escalated. Shown in the case grid.
| Metric | Agent | Baseline |
|---|---|---|
| Escalation recall gating metric — under-escalation is the asymmetric harm | 0.78 | 0.11 |
| Escalation precision | 0.88 | 1.00 |
| Escalation F1 | 0.82 | 0.20 |
| Tool-selection F1 | 0.81 | 0.81 |
36 real BrownBox conversations · DeepSeek deepseek-chat · human-verified gold · 2026-06-03. Baseline = a non-agentic keyword router on the same conversations and scorer.
Read this first
What this is: the mechanism, shown on a small verified set
This replay shows the working machinery of a bounded support agent — how it selects tools, reads tool observations, makes the escalate-or-resolve decision, and is scored against human-verified gold — on 36 real conversations. It is a demonstration of the method, not a measure of production accuracy. A real deployment is developed against a large corpus of labelled cases (thousands and up): the exact loop shown here — real request → human-verified correct action → score → tune the prompt and escalation policy — is how the agent is improved against that corpus, with worked examples and policy rules drawn from it. Raw support conversations are abundant; the scarce, decisive ingredient is gold that is correct for YOUR tools and escalation policy — organisation-specific, not downloadable off the shelf, and curated by hand (historical human resolutions are a weak label, not verified-correct). That curation is exactly the human-verification step shown here at small scale: the bottleneck is policy-specific gold, not conversation volume.
From this very run: That loop is not hypothetical on this page. Running the agent over these real cases surfaced a concrete defect — the model's escalation judgement was never wired into the decision loop — and fixing it moved escalation recall from 0.44 to 0.78 and precision from 0.67 to 0.88. With a larger labelled corpus, more refinements of this kind accumulate.
The numbers here measure the method on 36 conversations; they are not a claim about accuracy at production scale.
Worked trace · one case
How one case runs
The loop on a single conversation: request → tool selection → observation → escalate-or-resolve. This is the machinery; the score above is measured across all 36 cases.
Customer
“Hi, I am facing an issue while logging into my account. I am getting an error message saying that I have exceeded the number of attempts to enter the correct verification code.”
-
building_context
Parsed customer message
- Sentiment
- frustrated
-
thinking
Selected next action
- Reasoning
- The customer is reporting a login issue with an error about exceeding verification code attempts. This is a classic account lockout scenario. The first step to diagnose any login/account issue is to check the account status using the provided email address.
- Tool
- check_account_status
- Arguments
- email=neha.locked@example.com
- Escalate
- no
-
acting
Called check_account_status
- Tool
- check_account_status
- Arguments
- email=neha.locked@example.com
-
observing
Observed tool result
- Tool
- check_account_status
- Observation
- email=neha.locked@example.com, exists=True, locked=True, deactivated=False, account_created=2025-04-29, verification_status=pending
- Escalate
- yes
- Reason
- Account is locked — restoring access requires human identity review
-
escalated
Escalated to human specialist
- Reason
- Account is locked — restoring access requires human identity review
- Response to customer
- I understand this situation requires additional attention. Let me connect you with a specialist who can better assist you. A team member will reach out to you shortly. Thank you for your patience!
Human-verified gold
Check the account (locked after failed verification attempts) and escalate for a secure, human-reviewed unlock / identity check.
The full set
All 36 cases
27 resolved · 9 escalated · 11 diverge from gold. Every divergence is marked — over- or under-escalation, or a tool over/under-call — not hidden.
Order — Order Delivery Issues
Hi Sarah, I have an issue with my order. I received a BP monitor, but the delivery was not attempted again, and I'm not able to track my order. I'm really frustrated with the service.
Order — Product Installation
Hi Sarah, I received my DSLR camera yesterday, and I'm having trouble installing it. Can you help me with the installation process?
Order — Order Delivery Issues
Hi Sarah, I'm having issues with the delivery of my air conditioner. It was supposed to be delivered yesterday, but the delivery person didn't show up.
Cancellations and returns — Pickup and Shipping
Hi Alex. I recently received an email from BrownBox requesting me to ship back the computer monitor I purchased last week. Can you please tell me why I am being asked to ship it back?
Cancellations and returns — Replacement and Return Process
Hi Sarah, I am calling because I am unable to click the 'Cancel' button for my Juicer/Mixer/Grinder that I recently purchased from your website.
Cancellations and returns — Cash on Delivery (CoD) Refunds
Hi, I'm calling about my order for a water purifier. I received it yesterday, but it's not working correctly. I want to return it and get a refund.
Login and Account — Mobile Number and Email Verification
Hi Tom, I'm trying to log in to my account to purchase an Oven Toaster Grill (OTG), but I'm unable to proceed as it's asking for mobile number or email verification. Can you help me with that?
Login and Account — Login Issues and Error Messages
Hi, I am facing an issue while logging into my account. I am getting an error message saying that I have exceeded the number of attempts to enter the correct verification code.
Login and Account — Account Reactivation and Deactivation
Hi Alex. I'm trying to reactivate my account to purchase a Wet Grinder, but it's not letting me log in.
Shopping — Product Availability and Status
Hi Rachel, I was browsing your website for a pair of shoes, and I noticed that the one I want is currently out of stock. Can I still order it?
Warranty — Lost or Missing Warranty Card
Hi Sarah, I am calling regarding my refrigerator which I bought from BrownBox last year. I seem to have lost the warranty card, and I need to claim warranty for it. Can you help me with that?
Warranty — Start Date of Warranty
Hello, I purchased a Wet Grinder from BrownBox, and I'm not sure when the warranty starts. Can you help me with that?
Order — Order Delivery Issues
Hi Rachel. I am calling to inquire about the delivery of my refrigerator. It was supposed to be delivered last week, but it hasn't arrived yet.
Order — Invoice and Payment
Hi, I'm calling because I have a billing issue with my recent purchase.
Order — Order Delivery Issues
Hi Sarah, I'm John. I have an issue with my recent order.
Order — Miscellaneous
Hello, I would like to inquire about the estimated delivery time for my coffee maker order.
Order — Order Delivery Issues
Hi, my name is John, and I have a problem with my order. I ordered an air conditioner a week ago, and it hasn't been delivered yet.
Shipping — Availability of Faster Delivery Options
Hello, I was trying to place an order for baby food on your website, and it seems like I can't get it delivered to my location.
Cancellations and returns — Order Cancellation
Hi, I recently placed an order for a smartwatch on your website, but I need to cancel it. Can you help me with that?
Cancellations and returns — Return and Exchange
Hi Sarah, I recently received a food processor from BrownBox, but it's not working correctly. I would like to return or exchange it.
Cancellations and returns — Replacement and Return Process
Hi, I recently received an Oven Toaster Grills (OTG) from BrownBox, but it's not working. I need to replace it as soon as possible.
Cancellations and returns — Return and Exchange
Hi Sarah, my name is Lisa. I want to cancel my order for a sandwich maker that I purchased from your website.
Cancellations and returns — Return and Exchange
Hi, I received a wrong item in my order. I ordered a ceiling fan, but I received a table fan instead.
Cancellations and returns — Pickup and Shipping
Hi Sarah, I'm calling to change the address for the item pick-up of my Pram/Stroller.
Login and Account — Mobile Number and Email Verification
Hi Sarah, this is Jane. I have been trying to verify my mobile number and email address to place an order for a Kitchen Chimney, but I am not receiving the OTP or verification code.
Login and Account — Login Issues and Error Messages
Hi Sarah, this is John. I am having trouble logging into my account.
Login and Account — Mobile Number and Email Verification
Hi, I need to change the email ID linked to my account for my Coffee Maker.
Login and Account — Mobile Number and Email Verification
Hi Rachel. I'm trying to place an order on your website, but I'm having trouble with the mobile number and email verification process. Can you help me with that?
Login and Account — Account Reactivation and Deactivation
Hi Mark, I am trying to reactivate my account to purchase a television, but it's not working.
Login and Account — Login Methods
Hi, Alex. I am trying to sign up for a water purifier on your website, but I am having trouble with the login process. I am an international customer and I don't have a US phone number to receive the verification code. Can you help me with that?
Warranty — Start Date of Warranty
Hi John, this is Tom. I have a question regarding the warranty for an inverter I purchased from your website.
Shopping — Account and Shopping
Hi, I have a query regarding my account.
Warranty — Accessing Warranty Details
Hi Sarah, I recently purchased a DSLR camera from your website, and I want to know how to access the warranty details.
Warranty — Warranty Terms and Changes
Hi, I have a question about the warranty terms for a microwave oven I purchased from BrownBox.
Shopping — Pricing and Discounts
Hi Sarah, I am calling to inquire about the exchange offer for a toy that I recently purchased from BrownBox.
Shopping — Book Pricing Discrepancies
Hi John, my name is Alex. I have noticed some discrepancies in book pricing on your website. There are different prices for books with the same title, and I'm not sure why.
No overclaim
What's real vs simulated
Every component, stated plainly. The evidence above runs on the 'real' rows; the replay below is bounded by the rest.
Verbatim first customer turns from the public BrownBox dataset (rjac/e-commerce-customer-support-qa) — real support phrasing, not written to route cleanly. No private data.
deepseek deepseek-chat — real model run offline once; the trace below is the frozen output of that run.
Seeded fixture catalog (sim/fixtures.py): consistent synthetic orders, accounts, and charges with edge cases. A labeled sandbox — no production systems, no hash-fabricated values.
Human-verified gold (sim/gold_labels.json): required tools + escalation requirement per case.
36 real conversations — an illustrative walkthrough scored offline, not the page's evidence. The evidence is the 36-conversation canonical evaluation.
No private customer data; account/order identifiers are synthetic fixtures.
None at render time — the website replays a frozen trace; no network, no production systems, no live tool execution.
By design
The boundary
This is an illustrative, public-safe replay over real inputs — not a production system and not a leaderboard. The agent escalates the security/identity (locked, deactivated) states, the >$100 refund, and the out-of-scope action requests; the residual misses are a high-value refund it never acts on (so the guardrail never fires) and one borderline verification-limit case — shown in the case grid, not hidden.
Static replay from an offline evaluation. No production systems are accessed, no tools are executed, and no private data is used during rendering.
Why we automate selectively and escalate the rest: When do AI agents actually pay off?
Build a bounded support agent for your tools
The decisive ingredient is gold that's correct for your tools and escalation policy. We design the tool surface, the escalation boundary, and the human-verified evaluation loop before the first agent ships.