In today's digital landscape, ratings and reviews are the lifeblood of any app's success on the App Store. Users rely on these insights to make informed decisions about which apps to download or purchase. To revolutionize this process, we've developed a cutting-edge approach to review summarization that leverages Large Language Models (LLMs) to provide users with a high-level overview of an app's user experience.

Challenges in Review Summarization

Summarizing crowd-sourced user reviews presents several challenges, including timeliness, diversity, and accuracy. To overcome these hurdles, we designed a novel approach that prioritizes safety, fairness, truthfulness, and helpfulness in generating summaries.

The Power of LLMs

Our solution leverages generative AI to extract key insights from each review, understand and aggregate commonly occurring themes, balance sentiment, and output a summary reflective of broad user opinion. We fine-tuned three separate LLMs for this task: one for insight extraction, another for dynamic topic modeling, and a third for summary generation.

Insight Extraction

The first step in our process is to extract key insights from each review using an LLM fine-tuned with LoRA adapters. This approach distills each review into a set of distinct insights, encapsulating specific aspects of the review in standardized natural language. This structured representation enables effective comparison of relevant topics across different reviews.

Dynamic Topic Modeling

After extracting insights, we use dynamic topic modeling to group similar themes from user reviews and identify the most prominent topics discussed. Our model learns to distill each insight into a topic name in a standardized fashion while avoiding a fixed taxonomy. We then apply deduplication logic on an app-by-app basis to combine semantic related topics and pattern matching to account for variations in topic names.

Topic & Insight Selection

For each app, we select a set of topics for summarization, prioritizing topic popularity while incorporating additional criteria to enhance balance, relevance, helpfulness, and freshness. We ensure that the selected topics reflect the broader sentiment expressed by users by making sure that the representative insights gathered are consistent with the app's overall ratings.

Summary Generation

The final step in our process is to generate a summary from the selected insights using an LLM fine-tuned for this task. We fine-tuned this model using a large, diverse set of reference summaries written by human experts and continued fine-tuning using preference alignment.

Evaluation

To evaluate the effectiveness of our approach, we assembled a comprehensive dataset of summary pairs - comprised of the model's initially generated output and subsequent human-edited version - focusing on examples where the model's output could have been improved in composition to adhere more closely to the intended style. Our evaluation criteria included safety, groundedness, composition, and helpfulness.

By leveraging LLMs and addressing the challenges of review summarization, we've developed a novel approach that provides users with a high-level overview of an app's user experience. This approach has far-reaching implications for improving app user experience and enhancing the overall quality of reviews on the App Store.