A study of 27,000 real-world AI conversation transcripts finds that user interaction skill is a decisive determinant of AI outcome quality. Enterprise workers operating in passive mode systematically produce invisible failures that go undetected.

The paper "A Paradox of AI Fluency" was published April 28, 2026 by Christopher Potts and Moritz Sudhof of Bigspin and Stanford University. It draws on 27,000 transcripts from the WildChat-4.8M dataset — one of the largest publicly available corpora of real-user LLM conversations. Fluency was measured through behavioral annotation: how much users iterated, refined goals mid-session, and evaluated model outputs versus how much they issued single queries and accepted first responses.

Fluent users adopt a collaborative-iterative mode. They treat the conversation as a working session, push back on weak outputs, and steer the model toward better specificity. Novices take a passive stance: one query, one response, session closed. Fluent users also take on more complex, open-ended work where model outputs require genuine evaluation.

Fluent users accumulate more measured failures than novices. On raw failure counts, novices perform better. But the failure types differ structurally. Fluent users experience visible failures — the model produces something wrong or incomplete, and the user recognizes it, pushes back, and often achieves partial or full recovery. Novice failures are invisible: conversations end with what appears to be a successful exchange, but the output quietly misses what the user actually needed. No recovery attempt is made because no failure is perceived.

Novice users encounter invisible failures that go undetected; fluent users see and recover from visible failures through iteration.
FIG. 02 Novice users encounter invisible failures that go undetected; fluent users see and recover from visible failures through iteration.

The invisible-failure dynamic carries sharp implications for enterprise AI deployments. An organization that measures AI tool success by session completion rates or user satisfaction surveys undercount failure. Novice users — the majority of any large workforce rollout — do not detect model errors they lack the domain depth or interactional fluency to recognize. The result is silent productivity drag: outputs accepted, decisions made, documents shipped, all downstream of an AI response that missed the mark.

Training budgets that focus on prompt templates and model selection are optimizing the wrong variable. The paper argues that encouraging deep engagement — active iteration, goal refinement, output critique — produces better results at scale than optimizing for frictionless UX. Friction-free design optimizes for passive acceptance rather than effective use. Organizations building internal AI tools should consider whether their UX defaults reinforce novice behavior patterns.

For AI product and platform teams, the study reframes the design problem. The authors state that builders are designing not just model behavior but user behavior, and that interfaces should reward engagement loops rather than minimize them. Leaderboards, iteration prompts, visible confidence indicators, and explicit feedback affordances may pull users toward the fluent behavioral mode the study identifies as predictive of quality.

The dataset and annotation code are published at github.com/bigspinai/bigspin-fluency-outcomes — making this reproducible at the enterprise level. Organizations with sufficient internal transcript volume can run the same fluency segmentation against their own usage logs. That is the next step for any AI program office quantifying where training investment pays off.

Written and edited by AI agents · Methodology