← Guides AI agents

How to measure AI agent quality in commerce

How to measure AI agent quality without fooling yourself: why deflection is a vanity metric, the three numbers that matter, and how to read them together.

By Dana Dewany, Product Marketing, bitbybit Updated June 13, 2026 5 min read

In brief

Most AI agent dashboards lead with the wrong number. Deflection or containment rate — the share of conversations closed without a human — feels like progress, but on its own it's a vanity metric: you can drive it up simply by being worse, leaving customers stuck with a bot rather than helping them. The honest picture takes three numbers read together. Resolution rate asks whether the issue was actually solved. CSAT asks whether the customer was satisfied. Conversion asks whether the conversation moved someone toward a cart, an order, or a qualified lead — the part a commerce agent exists for and most teams forget to track. The diagnosis lives in how they move together: resolution up while CSAT falls means you're deflecting, not resolving; high CSAT with flat conversion means the agent is helpful but not selling. Pick outcome metrics over activity metrics, review them weekly as a set, and act on the pattern — not on any single number that flatters you.

You can make an AI agent’s headline number go up by making the agent worse. That one sentence is why measurement is the part of running an agent most teams get wrong. If you reward “conversations closed without a human,” an agent that stonewalls customers until they give up scores beautifully — and your business quietly bleeds. This guide is about measuring what actually matters, so the number on your dashboard and the experience of your customers point the same way. It’s the measurement half of the pillar: AI agents for commerce.

Why deflection is the metric to distrust

Deflection rate — sometimes called containment — is the share of conversations the agent handled with no human involved. It’s seductive because it’s easy to measure and it always feels like progress. But it measures the wrong thing: the absence of a human, not the presence of a solution. A customer who gets a vague non-answer, gives up, and closes the chat counts as a win by this metric. That’s not resolution; it’s surrender, recorded as success.

The field has a name for this: containment masquerading as resolution. When deflection climbs while satisfaction falls and customers re-contact you through another channel, the agent isn’t solving problems — it’s deflecting them faster. A useful discipline from teams who measure this seriously: a high containment rate only counts if resolution accuracy stays very high; below that, you’re frustrating customers more efficiently, which is the opposite of the goal. This is also why we reject deflection as the headline KPI entirely — a commerce agent’s job isn’t to keep customers away from your team. It’s to resolve, remember, and sell.

The three numbers that actually matter

Replace the single flattering number with three that, together, are hard to fool:

Metric	What it asks	The trap if read alone
Resolution rate	Was the issue actually solved?	Can be inflated by closing tickets the customer didn’t feel were solved
CSAT	Was the customer satisfied?	High on easy questions can mask failures on hard ones
Conversion	Did it move toward a cart, order, or lead?	Strong sales can hide poor service, and vice versa

None of these is sufficient by itself — each has a failure mode the others catch. Treat them as an inseparable set, not a leaderboard where one wins.

Read them together: the diagnostic patterns

The diagnosis almost never lives in a single number. It lives in how the three move relative to each other. A few patterns worth recognizing on sight:

Pattern	What it means	Where to look
Resolution ↑, CSAT ↓	Deflecting, not resolving	Knowledge base + escalation rules
Resolution ↓, CSAT steady	Handing off too much; can’t act	Skills the agent has switched on
CSAT ↑, conversion flat	Helpful but not selling	Product Recommendation + Follow-up
Resolution ↑, CSAT ↑, re-contacts ↑	Closing conversations too early	How you define “resolved”

This is the real reason to measure the set rather than the star metric: the relationships between the numbers are where the actionable truth is. A lone resolution rate of 80% tells you almost nothing; an 80% resolution rate with sliding CSAT tells you exactly what to fix.

The commerce metric most teams forget: conversion

Support teams instinctively track resolution and CSAT. The metric that gets dropped — and the one that separates a deflection tool from a commerce agent — is conversion: replies that turn into carts, orders the agent helped build, or qualified leads captured. A commerce agent isn’t only there to answer “where’s my order?”; it’s there to recommend the right product, build the cart, recover the abandoned one, and follow up. If resolution and CSAT look healthy but conversion is flat, your agent is a polite FAQ — answering well, but not doing the commercial work it exists for. That’s a signal to look at the recommendation and follow-up skills, not the knowledge base.

Building an agent scorecard

Turn the three metrics into a habit, not a quarterly surprise:

Weekly: the core set — resolution, CSAT, conversion — plus the escalation health numbers (post-handoff CSAT, re-escalation). Weekly cadence catches drift while it’s still cheap to fix.
Monthly: read actual transcripts, not just charts. Numbers tell you that something moved; conversations tell you why. Use the monthly read to tune knowledge, skills, guardrails, and escalation.

On bitbybit, the raw material is already there: the agent can collect a CSAT score right after it wraps a conversation, and every interaction is written to the bitCRM record — so quality is something you watch and improve, not guess at.

How to act on each signal

Metrics are only worth collecting if they change what you do. The mapping is direct:

CSAT slipping → tighten the knowledge base and the escalation rules; the agent is answering wrong or holding on too long.
Resolution low → check which skills are on and whether the agent can actually act (create an order, track one) rather than just talk.
Conversion flat → the recommendation and follow-up skills need work; the agent is helpful but passive.
Deflection up but CSAT down → stop celebrating. Re-read transcripts; the agent is ending conversations it isn’t solving.

Measured this way, an AI agent stops being a black box you hope is working and becomes a system you steer. And because every conversation enriches the record the next one draws on, a well-measured agent gets better in the direction you point it — not by magic, but because better context makes a better agent. Start from the top with the pillar, AI agents for commerce, or see what the agent does with each chat in AI Studio.

Frequently asked questions

What is the most important AI agent metric?

There isn't one — and that's the point. The three that matter are resolution rate (was the issue actually solved?), CSAT (was the customer satisfied?), and, for commerce, conversion (did the conversation lead to a cart, order, or qualified lead?). Read alone, each can mislead; read together, they tell you the truth. The metric to be most suspicious of is deflection or containment on its own, because you can inflate it just by being unhelpful enough that customers give up.

Why is deflection rate a vanity metric?

Because it measures that a human didn't get involved — not that the customer was helped. A high deflection rate paired with falling CSAT and rising re-contacts means the agent is ending conversations without solving them: containment masquerading as resolution. A useful rule of thumb from the field: a high containment rate only counts if resolution accuracy stays very high; otherwise you're just frustrating customers faster. Track deflection if you like, but never lead with it.

How do I read resolution, CSAT, and conversion together?

Look at how they move relative to each other. Resolution up while CSAT drops means the agent is closing conversations the customer didn't feel were solved — deflection. CSAT high but conversion flat means it's pleasant but not doing commerce work — recommendation and follow-up need attention. Resolution and CSAT both up while re-contacts rise means it's closing too early. The single numbers rarely lie on their own; the relationships between them are where the diagnosis is.

What conversion metric should a commerce agent track?

The commercial outcome the conversation was for: replies that become carts, completed orders the agent helped create, or qualified leads captured (for example from a click-to-WhatsApp ad). This is the metric that separates a support deflection tool from a commerce agent. If resolution and CSAT look good but conversion is flat, the agent is answering questions well but not recommending, building carts, or following up — which is where a commerce agent earns its keep.

How often should I review AI agent metrics?

Weekly is a healthy rhythm for the core set, with a deeper monthly read against real transcripts. Weekly catches drift early — a knowledge-base gap, a skill that's underperforming — while there's still time to fix it before it shows up as a trend. The monthly review is where you read whole conversations, not just numbers, and adjust the agent's knowledge, skills, guardrails, and escalation rules. Metrics tell you something changed; transcripts tell you why.

Last reviewed: June 13, 2026 Spot an error? [email protected]

Keep reading