Large language models (LLMs) have shown remarkable capabilities in generating user summaries from a long list of raw user activity data. These summaries capture essential user information such as preferences and interests, and therefore are invaluable for LLM-based personalization applications, such as explainable recommender systems. However, the development of new summarization techniques is hindered by the lack of ground-truth labels, the inherent subjectivity of user summaries, and human evaluation which is often costly and time-consuming. To address these challenges, we introduce \UserSumBench, a benchmark framework designed to facilitate iterative development of LLM-based summarization approaches. This framework offers two key components: (1) A reference-free summary quality metric. We show that this metric is effective and aligned with human preferences across three diverse datasets (MovieLens, Yelp and Amazon Review). (2) A novel robust summarization method that leverages time-hierarchical summarizer and self-critique verifier to produce high-quality summaries while eliminating hallucination. This method serves as a strong baseline for further innovation in summarization techniques.
翻译:大型语言模型(LLM)在从原始用户活动数据长列表中生成用户摘要方面展现出卓越能力。这些摘要能够捕捉用户偏好与兴趣等核心信息,因而对基于LLM的可解释推荐系统等个性化应用具有重要价值。然而,新摘要技术的发展受限于真实标注数据的缺失、用户摘要固有的主观性,以及通常成本高昂且耗时的人工评估。为应对这些挑战,本文提出UserSumBench——一个旨在促进基于LLM的摘要方法迭代开发的基准框架。该框架包含两个核心组件:(1)无需参考摘要的质量评估指标。我们通过三个多样化数据集(MovieLens、Yelp和亚马逊评论)证明该指标与人类偏好具有良好一致性。(2)创新的鲁棒摘要方法,通过时间分层摘要器与自批判验证器的协同作用,在消除幻觉的同时生成高质量摘要。该方法为摘要技术的进一步创新提供了强有力的基准。