← Guides AI agents

How to measure AI agent quality in commerce

How to measure AI agent quality without fooling yourself: why deflection is a vanity metric, the three numbers that matter, and how to read them together.

Updated June 13, 2026 5 min read

You can make an AI agent’s headline number go up by making the agent worse. That one sentence is why measurement is the part of running an agent most teams get wrong. If you reward “conversations closed without a human,” an agent that stonewalls customers until they give up scores beautifully — and your business quietly bleeds. This guide is about measuring what actually matters, so the number on your dashboard and the experience of your customers point the same way. It’s the measurement half of the pillar: AI agents for commerce.

Why deflection is the metric to distrust

Deflection rate — sometimes called containment — is the share of conversations the agent handled with no human involved. It’s seductive because it’s easy to measure and it always feels like progress. But it measures the wrong thing: the absence of a human, not the presence of a solution. A customer who gets a vague non-answer, gives up, and closes the chat counts as a win by this metric. That’s not resolution; it’s surrender, recorded as success.

The field has a name for this: containment masquerading as resolution. When deflection climbs while satisfaction falls and customers re-contact you through another channel, the agent isn’t solving problems — it’s deflecting them faster. A useful discipline from teams who measure this seriously: a high containment rate only counts if resolution accuracy stays very high; below that, you’re frustrating customers more efficiently, which is the opposite of the goal. This is also why we reject deflection as the headline KPI entirely — a commerce agent’s job isn’t to keep customers away from your team. It’s to resolve, remember, and sell.

The three numbers that actually matter

Replace the single flattering number with three that, together, are hard to fool:

MetricWhat it asksThe trap if read alone
Resolution rateWas the issue actually solved?Can be inflated by closing tickets the customer didn’t feel were solved
CSATWas the customer satisfied?High on easy questions can mask failures on hard ones
ConversionDid it move toward a cart, order, or lead?Strong sales can hide poor service, and vice versa

None of these is sufficient by itself — each has a failure mode the others catch. Treat them as an inseparable set, not a leaderboard where one wins.

Read them together: the diagnostic patterns

The diagnosis almost never lives in a single number. It lives in how the three move relative to each other. A few patterns worth recognizing on sight:

PatternWhat it meansWhere to look
Resolution ↑, CSAT ↓Deflecting, not resolvingKnowledge base + escalation rules
Resolution ↓, CSAT steadyHanding off too much; can’t actSkills the agent has switched on
CSAT ↑, conversion flatHelpful but not sellingProduct Recommendation + Follow-up
Resolution ↑, CSAT ↑, re-contacts ↑Closing conversations too earlyHow you define “resolved”

This is the real reason to measure the set rather than the star metric: the relationships between the numbers are where the actionable truth is. A lone resolution rate of 80% tells you almost nothing; an 80% resolution rate with sliding CSAT tells you exactly what to fix.

The commerce metric most teams forget: conversion

Support teams instinctively track resolution and CSAT. The metric that gets dropped — and the one that separates a deflection tool from a commerce agent — is conversion: replies that turn into carts, orders the agent helped build, or qualified leads captured. A commerce agent isn’t only there to answer “where’s my order?”; it’s there to recommend the right product, build the cart, recover the abandoned one, and follow up. If resolution and CSAT look healthy but conversion is flat, your agent is a polite FAQ — answering well, but not doing the commercial work it exists for. That’s a signal to look at the recommendation and follow-up skills, not the knowledge base.

Building an agent scorecard

Turn the three metrics into a habit, not a quarterly surprise:

  • Weekly: the core set — resolution, CSAT, conversion — plus the escalation health numbers (post-handoff CSAT, re-escalation). Weekly cadence catches drift while it’s still cheap to fix.
  • Monthly: read actual transcripts, not just charts. Numbers tell you that something moved; conversations tell you why. Use the monthly read to tune knowledge, skills, guardrails, and escalation.

On bitbybit, the raw material is already there: the agent can collect a CSAT score right after it wraps a conversation, and every interaction is written to the bitCRM record — so quality is something you watch and improve, not guess at.

How to act on each signal

Metrics are only worth collecting if they change what you do. The mapping is direct:

  • CSAT slipping → tighten the knowledge base and the escalation rules; the agent is answering wrong or holding on too long.
  • Resolution low → check which skills are on and whether the agent can actually act (create an order, track one) rather than just talk.
  • Conversion flat → the recommendation and follow-up skills need work; the agent is helpful but passive.
  • Deflection up but CSAT down → stop celebrating. Re-read transcripts; the agent is ending conversations it isn’t solving.

Measured this way, an AI agent stops being a black box you hope is working and becomes a system you steer. And because every conversation enriches the record the next one draws on, a well-measured agent gets better in the direction you point it — not by magic, but because better context makes a better agent. Start from the top with the pillar, AI agents for commerce, or see what the agent does with each chat in AI Studio.

Frequently asked questions

What is the most important AI agent metric?

There isn't one — and that's the point. The three that matter are resolution rate (was the issue actually solved?), CSAT (was the customer satisfied?), and, for commerce, conversion (did the conversation lead to a cart, order, or qualified lead?). Read alone, each can mislead; read together, they tell you the truth. The metric to be most suspicious of is deflection or containment on its own, because you can inflate it just by being unhelpful enough that customers give up.

Why is deflection rate a vanity metric?

Because it measures that a human didn't get involved — not that the customer was helped. A high deflection rate paired with falling CSAT and rising re-contacts means the agent is ending conversations without solving them: containment masquerading as resolution. A useful rule of thumb from the field: a high containment rate only counts if resolution accuracy stays very high; otherwise you're just frustrating customers faster. Track deflection if you like, but never lead with it.

How do I read resolution, CSAT, and conversion together?

Look at how they move relative to each other. Resolution up while CSAT drops means the agent is closing conversations the customer didn't feel were solved — deflection. CSAT high but conversion flat means it's pleasant but not doing commerce work — recommendation and follow-up need attention. Resolution and CSAT both up while re-contacts rise means it's closing too early. The single numbers rarely lie on their own; the relationships between them are where the diagnosis is.

What conversion metric should a commerce agent track?

The commercial outcome the conversation was for: replies that become carts, completed orders the agent helped create, or qualified leads captured (for example from a click-to-WhatsApp ad). This is the metric that separates a support deflection tool from a commerce agent. If resolution and CSAT look good but conversion is flat, the agent is answering questions well but not recommending, building carts, or following up — which is where a commerce agent earns its keep.

How often should I review AI agent metrics?

Weekly is a healthy rhythm for the core set, with a deeper monthly read against real transcripts. Weekly catches drift early — a knowledge-base gap, a skill that's underperforming — while there's still time to fix it before it shows up as a trend. The monthly review is where you read whole conversations, not just numbers, and adjust the agent's knowledge, skills, guardrails, and escalation rules. Metrics tell you something changed; transcripts tell you why.

Last reviewed: June 13, 2026 Spot an error? [email protected]
Keep reading
Try it

See what an AI agent does with every chat.

bitChat and AI Studio answer questions, recommend products, and follow up — on WhatsApp, from one customer record. Start free, no credit card.

No credit card required Set up in minutes Cancel anytime