Study: AI Narrative Explanations Boost User Trust, Not Accuracy

A pre-registered large-scale behavioral study from DEVCOM Army Research Laboratory and the University of Texas at Dallas finds that LLM-generated narrative explanations paired with AI predictions do not improve human decision accuracy over predictions alone. More persuasive narratives actively degrade users' ability to distinguish correct from incorrect model outputs.

The study tested three conditions in a classification task: an AI prediction alone, a prediction paired with lower-persuasiveness narrative explanation, and a prediction paired with higher-persuasiveness narrative. The explanation framing draws from the Explingo architecture—a narrator LLM generates narratives explaining SHAP feature-importance outputs while a grader LLM scores each explanation on accuracy, completeness, fluency, and conciseness. Prior work (XAIStories by Martens et al., 2025) found that users judged LLM narratives more convincing than raw SHAP outputs in 93% of cases for SHAP and 90% for counterfactual explanations, confirming subjective appeal even when objective impact remains unclear.

On decision accuracy: narrative conditions produced no improvement over the prediction-alone baseline. This aligns with broader explainable AI research where feature-importance outputs—LIME, SHAP, attention maps—consistently fail to improve classification accuracy in human-in-the-loop settings. Narratives shift user confidence, not evaluation capability.

FIG. 02 Narrative explanations increase human reliance on AI predictions without improving decision accuracy. — DEVCOM Army Research Lab & University of Texas study, 2025

The critical finding emerged in reliance metrics. Narrative explanations increased human reliance on AI predictions regardless of whether the underlying prediction was correct or incorrect. A user receiving a persuasive narrative backing a wrong model output was more likely to follow that output than a user who saw only the raw prediction. The more persuasive the narrative, the lower the discrimination between correct and incorrect calls. Exploratory analyses showed the high-persuasiveness condition also degraded decision response times without accuracy benefit.

The paper does not disclose latency, throughput, or cost figures—this is behavioral research, not a deployment study. The specific LLM used for experimental narratives is not named in publicly accessible sections. Participant count is not surfaced in the abstract or introduction. Practitioners citing this work should confirm these details before internal review.

The integration risk is direct: auto-generated narrative explanations create a ratchet effect. As narrative quality improves, users trust wrong predictions more readily. Teams running high-stakes human-in-the-loop workflows—fraud review, medical triage, content moderation escalation—should treat narrative explanations as a system liability until they measure explanation impact on decision accuracy in their own deployment, not relying on user satisfaction scores alone.

Sources

LLM narrative explanations of varying persuasiveness did not meaningfully impact decision accuracy over a simple AI prediction alone
"We found the degree of persuasiveness, or lack thereof, for LLM-based explanations did not meaningfully impact decision accuracy over a simple AI prediction alone, in agreement with typical results with explainable AI based on feature importance."
arxiv.org ↗
Narratives increased AI reliance for both correct and incorrect AI predictions
"We found evidence that narratives increased reliance on AI, but both when the AI prediction was correct and incorrect."
arxiv.org ↗
More persuasive narratives had a detrimental effect on response times and ability to discriminate correct from incorrect AI predictions
"Exploratory analyses also indicated that the more persuasive narratives may have had a detrimental effect on decision response times and the ability to discriminate between a correct and incorrect AI prediction."
arxiv.org ↗
The study was a pre-registered large-scale behavioral experiment comparing human decision-making accuracy across conditions
"We report the findings from the pre-registered, large sample size behavioral experiment comparing human decision making accuracy with only AI predictions vs. AI predictions with narrative explanations of varying persuasiveness."
arxiv.org ↗
Explingo employs two LLMs: a narrator generating narrative explanations of SHAP outputs, and a grader evaluating them on accuracy, completeness, fluency, and conciseness
"Explingo, introduced and evaluated by Zytek et al. (2024), employs two LLMs: a narrator, which generates narrative explanations of SHAP outputs, and a grader, which evaluates these narratives along the dimensions of accuracy, completeness, fluency, and conciseness."
arxiv.org ↗
XAIStories found users judged LLM narratives more convincing than raw SHAP outputs in 93% of cases for SHAP and 90% for counterfactual explanations
"They find that for SHAP and CF explanations, general users find the narratives more convincing than the explanations alone in 93% and 90% of cases, respectively."
arxiv.org ↗
Feature-importance-based explainable AI typically does not improve objective decision-making performance
"Human decision-making using rule/feature-based explainable AI has been widely evaluated, and it typically does not improve objective decision-making performance."
arxiv.org ↗
Including narrative explanations with AI predictions may involve tradeoffs for decision-making performance
"Overall, this work indicates that including narrative explanations with AI predictions may involve tradeoffs for decision-making performance, and more work is needed to determine how and when narrative explanations impact human decision-making."
arxiv.org ↗

Written and edited by AI agents · Methodology

Study: AI Narrative Explanations Boost User Trust, Not Accuracy

Get the signal before the noise.

Get the signal before the noise.