Towards Next-Generation Recommender Systems: A Benchmark for Personalized Recommendation Assistant with LLMs

Recommender systems (RecSys) are widely used across various modern digital platforms and have garnered significant attention. Traditional recommender systems usually focus only on fixed and simple recommendation scenarios, making it difficult to generalize to new and unseen recommendation tasks in an interactive paradigm. Recently, the advancement of large language models (LLMs) has revolutionized the foundational architecture of RecSys, driving their evolution into more intelligent and interactive personalized recommendation assistants. However, most existing studies rely on fixed task-specific prompt templates to generate recommendations and evaluate the performance of personalized assistants, which limits the comprehensive assessments of their capabilities. This is because commonly used datasets lack high-quality textual user queries that reflect real-world recommendation scenarios, making them unsuitable for evaluating LLM-based personalized recommendation assistants. To address this gap, we introduce RecBench+, a new dataset benchmark designed to access LLMs' ability to handle intricate user recommendation needs in the era of LLMs. RecBench+ encompasses a diverse set of queries that span both hard conditions and soft preferences, with varying difficulty levels. We evaluated commonly used LLMs on RecBench+ and uncovered below findings: 1) LLMs demonstrate preliminary abilities to act as recommendation assistants, 2) LLMs are better at handling queries with explicitly stated conditions, while facing challenges with queries that require reasoning or contain misleading information. Our dataset has been released at https://github.com/jiani-huang/RecBench.git.

翻译：推荐系统（RecSys）在现代数字平台中广泛应用并受到广泛关注。传统推荐系统通常仅关注固定且简单的推荐场景，难以在交互范式中泛化至新的、未见过的推荐任务。近年来，大语言模型（LLMs）的发展革新了推荐系统的基础架构，推动其演变为更智能、更具交互性的个性化推荐助手。然而，现有研究大多依赖固定的任务特定提示模板生成推荐并评估个性化助手的性能，这限制了对模型能力的全面评估。这是因为常用数据集缺乏反映真实推荐场景的高质量文本用户查询，使其不适用于评估基于大语言模型的个性化推荐助手。为填补这一空白，我们提出了RecBench+——一个专为评估大语言模型时代下处理复杂用户推荐需求能力而设计的新数据集基准。RecBench+涵盖多样化查询集合，包含硬性条件与软性偏好，且难度层次分明。我们在RecBench+上评估了常用大语言模型并发现以下结论：1）大语言模型展现出作为推荐助手的初步能力；2）大语言模型更擅长处理明确陈述条件的查询，而在需要推理或包含误导信息的查询上面临挑战。本数据集已发布于https://github.com/jiani-huang/RecBench.git。