Multiply the small numbers
Take a step that works 95 percent of the time. Chain ten into an autonomous task and your success rate is 0.95 to the tenth — about 60 percent. Chain twenty and you are under a coin flip. That is not pessimism. It is multiplication, and it is the arithmetic that turned a year of confident enterprise pilots into a quiet column of cancellations once the demos met the long tail.
We want to be clear, because the builders like to cast us as the nervous committee: we build these systems. We are not against agents. We are against shipping autonomy the system has not earned. The dangerous failure is not the agent that crashes — that one you catch. It is the agent that is wrong with perfect fluency, because fluency is exactly what stops a human from checking. A confident wrong answer is more expensive than an honest error message, and our whole field spent a decade learning that the hard way.
The comparison to a fallible human misses the shape of the risk. A junior analyst’s mistakes are bounded, legible, and slow. An agent’s mistakes are fast, correlated, and replicated across every task it runs at once. One bad pattern does not stay one mistake. It becomes ten thousand before lunch.
Where we concede ground: On narrow, bounded, reversible tasks the bar is being cleared faster than we forecast. We were too gloomy there.
What would change our mind: Audited reliability above the human baseline on open-ended, multi-step tasks, sustained across two model generations.
Read the full synthesis: Can you trust an AI agent to act for you?