What happened
Intuit, the company behind QuickBooks, TurboTax, and MailChimp, has published data from one of the largest real-world AI agent deployments to date. After rolling out specialized AI agents to 3 million customers across sales, tax, payroll, accounting, and project management, the company found that 85% of users came back to use the agents again. The engagement numbers held not because users had no choice, but because the product demonstrably worked.
The agents handle real tasks: closing books, categorizing transactions, running payroll, automating invoice reminders, surfacing discrepancies. Customers report invoices being paid 90% in full and five days faster, with 30% less manual work. One customer uncovered significant internal fraud by asking the agent to explain why certain amounts did not add up.
The company's EVP and GM, Marianna Tessel, credits the adoption to one design decision above all: keeping humans accessible. Not product support staff, but actual domain experts: accountants, tax specialists, payroll professionals. AI handles the task up to a point, then a human expert is available to review, confirm, or take over. The platform is also built to actively suggest human involvement in high-stakes scenarios rather than defaulting to autonomous action.
Why it matters for businesses
The AI industry has been noisy about autonomy. The dominant narrative is that the goal is full automation: agents that run without human involvement, approve their own decisions, and operate unsupervised at scale. The Intuit data is a useful corrective. At production scale, across millions of real users dealing with real financial data, the combination of AI and human expertise outperformed AI alone. That is not a failure of the technology. It is evidence about where the technology actually works.
For most enterprise use cases, the goal is not to eliminate humans from the process. The goal is to eliminate the tedious, repetitive, error-prone parts so that the humans can focus on judgment calls, exceptions, and high-value decisions. An AI agent that handles 70% of invoice processing automatically, flags anomalies for human review, and routes edge cases to the right person is more valuable than one that attempts 100% automation and fails on 15% of cases in ways nobody catches.
There is also a trust dimension that Intuit's data makes concrete. Users who know a human expert is available are more willing to engage with AI for tasks they care about. Trust is not built by demonstrating AI capability in isolation. It is built by demonstrating that the system knows its own limits and routes accordingly. That changes how AI agents should be designed: not as systems that avoid human involvement, but as systems that manage the handoff intelligently.
Laava's perspective
Intuit's findings match what we see in every production deployment we run. The agents that get used are the ones that know when to stop and ask. Shadow mode, which we use in every Laava deployment, is not a temporary measure while the AI builds confidence. It is a permanent feature of responsible agentic architecture. An agent processing purchase orders that encounters an unusual vendor for an unusually large amount should flag it, not approve it. The human approval step is not a bottleneck. It is what makes the system trustworthy enough to handle the other 95% autonomously.
We see the autonomy-first framing as a sales pitch, not an engineering principle. Fully autonomous agents sound impressive in demos. In production, they create liability. When an agent makes an error at scale, the question is not whether it happened but whether there was a human review step that could have caught it. Enterprises that skip the human-in-loop architecture to ship faster tend to rebuild it after the first incident.
The 85% repeat usage figure matters for a different reason as well. It is a usage metric, not a capability metric. It tells you whether real users in real workflows found the system worth returning to. That is a much harder bar than benchmark performance or demo success rates. Most enterprise AI projects never get measured this way, because they are not deployed to enough users long enough to generate the data. Intuit did. The lesson is worth taking seriously.
What you can do
If you are designing or evaluating an AI agent deployment, ask two concrete questions. First: where does your agent hand off to a human, and is that handoff well-designed or an afterthought? A good handoff includes context: the agent should pass along what it processed, what it was uncertain about, and what action it is recommending, so the human reviewer does not start from scratch. Second: do you have a way to measure repeat engagement over time? If users try the agent once and do not come back, that tells you something important about whether it is solving a real problem.
Laava builds AI agents with shadow mode and human-in-loop handoffs as standard architecture, not optional features. If you are planning a deployment and want to understand how to design the human oversight layer, we are available for a free 90-minute roadmap session.