Abstract

Large Language Models (LLMs) are revolutionizing mobile applications, but their integration also introduces the risk of "hallucinations" - generating plausible yet incorrect or nonsensical information. These AI errors can significantly degrade user experience and erode trust. However, there is limited understanding of how users perceive, report, and are impacted by LLM hallucinations in real-world mobile app settings.

To bridge this gap, we conducted a large-scale empirical study analyzing 3 million user reviews from 90 diverse AI-powered mobile apps to characterize user-reported issues. Our heuristic-based User-Reported LLM Hallucination Detection algorithm identified 20,000 candidate reviews, from which 1,000 were manually annotated. This analysis estimates the prevalence of user reports indicative of LLM hallucinations, finding approximately 1.75% within reviews initially flagged as relevant to AI errors.

A data-driven taxonomy of seven user-perceived LLM hallucination types was developed, with Factual Incorrectness (H1) emerging as the most frequently reported type, accounting for 38% of instances, followed by Nonsensical/Irrelevant Output (H3) at 25%, and Fabricated Information (H2) at 15%. Furthermore, linguistic patterns were identified using N-grams generation, Non-Negative Matrix Factorization (NMF) topics, and sentiment characteristics using VADER, showing significantly lower scores for hallucination reports associated with these reviews.

These findings offer critical implications for software quality assurance, highlighting the need for targeted monitoring and mitigation strategies for AI mobile apps. This research provides a foundational, user-centric understanding of LLM hallucinations, paving the way for improved AI model development and more trustworthy mobile applications.

Introduction

The proliferation of mobile applications integrating advanced Large Language Models (LLMs) has ushered in a new era of user interaction and functionality. These AI-powered mobile apps promise to revolutionize user experiences by offering more intuitive, personalized, and intelligent services. However, this rapid adoption is accompanied by a significant and persistent challenge inherent to current LLM technology: the phenomenon of "hallucination." LLMs are prone to producing outputs that are factually incorrect, nonsensical, unfaithful to provided source content, or deviate from user intent, often with a high degree of apparent confidence.

Understanding real-world user encounters with LLM hallucinations is crucial, particularly as evaluations conducted in controlled laboratory settings or using synthetic benchmarks may not fully capture the spectrum of issues or their nuanced impact on everyday users interacting with deployed mobile applications. App store reviews offer a unique lens through which to observe these "in-the-wild" experiences.

The impact of LLM hallucinations on mobile users can be substantial. For instance, an AI travel planning app might generate incorrect flight details or recommend non-existent attractions; a learning app could provide erroneous factual information; or a productivity tool might summarize a document with fabricated key points. Such experiences can directly mislead users, lead to wasted time, cause frustration, and severely undermine their trust in the AI feature and the application as a whole.

Despite the acknowledgment of hallucination as a general LLM problem, there remains a significant gap in empirically characterizing how these issues manifest specifically within AI mobile apps and how users articulate these problems in their natural language feedback. Current understanding is often based on technical evaluations or general surveys on LLM challenges rather than a focused analysis of user-generated reports from the mobile app ecosystem.

Consequently, this study aims to bridge this gap by systematically analyzing user reviews from a diverse range of AI-powered mobile applications. Our primary goal is to understand and detect user-reported LLM hallucinations directly from their feedback. To achieve this, we address the following research questions: (RQ1) How prevalent are user reports potentially related to LLM hallucination in reviews of AI mobile apps? (RQ2) What types of LLM hallucination do users appear to report in their reviews? (RQ3) What characteristics do user reviews containing potential hallucination reports have? and (RQ4) What are the implications of user-reported hallucination for software quality assurance and the development of AI mobile apps?

To address these questions, this paper makes the following contributions: first, it provides a large-scale empirical study analyzing 3 million user reviews from 90 diverse AI-powered mobile apps to characterize user-reported issues; second, it develops a heuristic-based User-Reported LLM Hallucination Detection algorithm to identify candidate reviews and manually annotate them; third, it estimates the prevalence of user reports indicative of LLM hallucinations and identifies a data-driven taxonomy of seven user-perceived LLM hallucination types; and fourth, it analyzes linguistic patterns using N-grams generation, Non-Negative Matrix Factorization (NMF) topics, and sentiment characteristics using VADER.

By shedding light on the phenomenon of AI hallucinations in mobile apps, this study aims to contribute to the development of more trustworthy AI-powered mobile applications that can effectively integrate Large Language Models.