Your agent works in the demo because the demo is a lie

Thibaut Bardout
17 juin
6 min de lecture

Dernière mise à jour : il y a 2 jours

On a fintech engagement last year, I watched a servicing agent tell a customer in financial hardship that it'd paused her payments for 60 days and that her 0% promotional rate was safe. She thanked it. The chat closed with a high CSAT.

It was impossible. The startup was a buy-now-pay-later lender: split a purchase into a few interest-free installments, with some plans running on a promotional 0%. That 0% isn't a gift, it's conditional. The customer keeps it only while they hold to the exact payment schedule they signed up for. Pause the schedule and the condition that earns the 0% is gone. So "I've paused your payments and your 0% is untouched" isn't generous, it's a promise the company's own offer can't honour. (In regulated credit it's a compliance problem too, but the offer terms alone already make it impossible.)

The team had built that agent with real authority. It could reschedule a payment, waive a fee, issue a credit, change a plan. It'd demoed cleanly, it was live, and the support numbers were damn good. They brought me in to help scale it, and before scaling anything I read what it'd been doing in production. That promise wasn't a glitch. It was the agent working exactly as built, on the kind of input a demo never shows you.

That's the thing about demos. A demo proves your agent can do the easy thing on a good day. It tells you almost nothing about what the same agent does under real load, and it doesn't fail at random. It lies in one direction, for reasons baked into what an agent is.

An agent is a probability engine

Strip the marketing and an AI agent is a probability engine. It doesn't look up the correct action, it predicts the most likely one given everything in front of it. On the inputs you put in a demo, the most likely answer and the correct answer are the same, which is exactly why you chose those inputs. So a demo, by design, only ever shows you the inputs where "most plausible" and "correct" point the same way.

A demo isn't 1 clean pass of 1 path. It's a handful of clean passes through the paths you picked. That's the catch: you picked them.

Production is full of the other inputs, the ones where the most plausible answer and the correct one come apart. Look at the hardship case again, because it's tempting to say the agent simply hadn't been taught this one. It had what it needed. Pausing voids the 0%, so the right move was to not promise both, and a decent team could have known that in advance. But to a stressed customer, "yes, I've paused it, and your rate is safe" is the most plausible, most reassuring thing to say, and a probability engine reaches for the most plausible continuation, not the permitted one. The correct answer existed. It just wasn't the likely one.

You can't instruct your way out

This is where most teams reach for the obvious fix. It doesn't hold, and the reason it doesn't is the reason the failure happened.

You anticipate the case, you write "never pause a promotional plan" into the prompt, and you feel covered. You're not. An instruction to a probabilistic system isn't a rule it obeys. It's one more piece of context, weighed against everything else in the conversation: the customer's distress, the brand voice that tells the agent to be generous, the hundreds of past chats where saying yes was right. "Always comply" and "never do X" are priors the model usually honours and quietly overrides in the exact case you wrote them for.

A prompt is a suggestion; the model still decides.

This is also why "just test more scenarios" misses. The failure wasn't that nobody imagined hardship-meets-promo. A decent team imagines plenty, and you should assume they did. The failure is that imagining a case, and even writing a rule against it, doesn't guarantee the agent won't do it anyway. Enumeration doesn't save you either: an agent with 8 actions across a dozen account states and a handful of live promos has more combinations than anyone writes down, so the long tail is effectively infinite. But the sharp point isn't the size of the tail. It's that anticipation isn't enforcement. The cases you did foresee aren't safe, because the thing enforcing your rule is the same thing that guesses.

Why step 10 breaks

There's a second reason a clean demo proves so little, and it's pure arithmetic. A demo is a short pass. Production is a long chain, run thousands of times a day, and reliability doesn't hold across a chain. It compounds down.

Say each step is 95% reliable, which would already be a decent agent. Chain 10 of them, a normal length for a servicing flow:

0.95^10 ≈ 0.60

Make the numbers kinder and it barely moves:

0.98^20 ≈ 2/3,

0.98^30 ≈ 1/2.

A near-perfect agent is a coin flip by step 30. Your demo showed you a step landing correctly, a few times. It said nothing about the chain.

In my example, that's exactly where it died. The agent misread 1 thing early, that the promo could survive a pause, and every step after it inherited the mistake as settled fact. It chose an action on a false premise, executed against it, and recorded the outcome as a success. One wrong link, carried clean to the end.

So here's the line worth keeping: the demo tests the step, production tests the chain.

The agent grades its own work

There's a last reason the demo lies, and it's the one that lets the damage run for months. The agent grades its own work. The same system that made the wrong call writes the record of it, so the log says resolved and moves on. Customers were satisfied, because the agent kept handing them outcomes they wanted and weren't owed, which lifts your CSAT rather than denting it. Every dashboard the team watched reported a healthy agent.

Back to the fintech case: the failure didn't surface from product or QA. It surfaced from finance, when reconciliation turned up a pile of credits and payment pauses with no authorised reason behind them. By the time someone asked how big the exposure was, no one could say, because the only account of what the agent had done was written by the agent that got it wrong.

A better model raises the odds, not the guarantee

When I laid this out, the 1st instinct in the room was the one I hear every time: it's a model problem, the next model will be smarter. And a smarter model is better, genuinely. Give it more capability and the odds it reads the hardship case correctly go up. I won't pretend otherwise.

The trouble is what "better" buys you. It buys a higher probability, and what you needed was a guarantee. Those aren't the same currency. A money-moving action can't run on "almost always": 99% per step is still 0.99^30 ≈ 0.74 over a 30-step chain, and the missing 26% lands in real accounts. You can climb from 95% to 99% to 99.9% and never reach the only number that matters for "never pause a promo plan," which is 100. That's not a quality you tune up. It's a line a probability doesn't cross.

What actually fixes it

The fix isn't a better guesser, it's a different shape, and most of it is old engineering. You give the agent an explicit model of what's actually permitted, so it can't invent a rule that was never real. You put a deterministic check between its intention and anything that moves money, so a plausible answer can't promote itself into an executed one. You keep the agent off the system of record, and you verify the check rather than trust it. The model stays at the edges, reading the customer and drafting the reply. It stops being the thing that decides.

How you build that is the subject of the next piece. The short version, from the rebuild: with customers the agent behaved exactly as before, warm and quick, but it had lost the ability to promise what the company couldn't keep, because the decision it kept getting wrong no longer lived in the part of the system that guesses.

Before you ship, look at your demo again and be honest about what it showed you. It showed the agent the questions you already knew to ask, a few clean times, with itself as the judge. None of those conditions hold the moment real customers arrive.

Every agent works in the demo. The real question is whether it can tell the difference between an answer that's plausible and an action that's permitted, and that's the one thing a demo can't test.

Your customers will test it for you.

At PathUp, the gap between plausible and permitted is most of what we're called in to close, usually right after a team's agent looked ready and wasn't. If that's where you're standing, the next piece is the one to read.