The integration of Large Language Models (LLMs) into recommendation systems has introduced unprecedented capabilities for natural language understanding, explanation generation, and conversational interactions. However, existing evaluation methodologies focus predominantly on traditional accuracy metrics, failing to capture the multifaceted human-centered qualities that determine the real-world user experience. We introduce \framework{} (\textbf{H}uman-centered \textbf{E}valuation for \textbf{L}LM-powered reco\textbf{M}menders), a comprehensive evaluation framework that systematically assesses LLM-powered recommender systems across five human-centered dimensions: \textit{Intent Alignment}, \textit{Explanation Quality}, \textit{Interaction Naturalness}, \textit{Trust \& Transparency}, and \textit{Fairness \& Diversity}. Through extensive experiments involving three state-of-the-art LLM-based recommenders (GPT-4, LLaMA-3.1, and P5) across three domains (movies, books, and restaurants), and rigorous evaluation by 12 domain experts using 847 recommendation scenarios, we demonstrate that \framework{} reveals critical quality dimensions invisible to traditional metrics. Our results show that while GPT-4 achieves superior explanation quality (4.21/5.0) and interaction naturalness (4.35/5.0), it exhibits a significant popularity bias (Gini coefficient 0.73) compared to traditional collaborative filtering (0.58). We release \framework{} as an open-source toolkit to advance human-centered evaluation practices in the recommender systems community.
翻译:将大型语言模型(LLM)集成到推荐系统中,为自然语言理解、解释生成和对话交互带来了前所未有的能力。然而,现有的评估方法主要侧重于传统的准确性指标,未能捕捉决定现实世界用户体验的多维度人本特性。本文提出 \framework{}(面向LLM驱动推荐系统的**人**本**评**估框**架**),这是一个全面的评估框架,系统性地从五个以人为中心维度评估LLM驱动的推荐系统:**意图对齐**、**解释质量**、**交互自然度**、**信任与透明度**以及**公平性与多样性**。通过对三个领域(电影、书籍和餐厅)的三种先进LLM推荐模型(GPT-4、LLaMA-3.1和P5)进行广泛实验,并由12位领域专家使用847个推荐场景进行严格评估,我们证明 \framework{} 能够揭示传统指标无法观测的关键质量维度。实验结果表明,虽然GPT-4在解释质量(4.21/5.0)和交互自然度(4.35/5.0)上表现优异,但其表现出显著的流行度偏差(基尼系数0.73),而传统协同过滤方法的基尼系数为0.58。我们将 \framework{} 作为开源工具包发布,以推动推荐系统领域的人本评估实践发展。