Feature engineering has long been central to recommender systems, yet effectively leveraging textual item features remains challenging. Recent advances in large language models (LLMs) have enabled their use as semantic encoders for recommendation, but their roles and behaviors in this setting are still not well understood. Prior studies often rely on general-purpose embedding benchmarks (e.g., MTEB) when selecting LLMs, overlooking the unique characteristics of recommendation tasks. To address this gap, we introduce BLaIR, a comprehensive benchmark for evaluating LLMs as semantic encoders in recommendation scenarios. We contribute (1) a new large-scale Amazon Reviews 2023 dataset with over 570 million reviews and 48 million items, (2) a unified benchmark covering sequential recommendation, collaborative filtering, and product search, and (3) a new complex-query product search task featuring both semi-synthetic and real-world evaluation datasets. Experiments with 11 leading LLMs show that their rankings on BLaIR show little correlation with MTEB, highlighting the unique challenges of semantic encoding in recommendation.
翻译:特征工程长期以来一直是推荐系统的核心,然而有效利用文本物品特征仍然颇具挑战。近期大语言模型的进展使其可作为推荐系统的语义编码器,但其在此场景下的角色与行为仍未被充分理解。先前研究在选择大语言模型时,常依赖通用嵌入基准(如MTEB),忽视了推荐任务的独特性。为填补这一空白,我们提出BLaIR,一个全面评估大语言模型作为推荐场景中语义编码器的基准。我们贡献了:(1)包含超过5.7亿条评论和4800万物品的新大规模Amazon Reviews 2023数据集,(2)涵盖序列推荐、协同过滤和产品搜索的统一基准,以及(3)包含半合成与真实世界评估数据集的新型复杂查询产品搜索任务。对11个主流大语言模型的实验表明,其在BLaIR上的排名与MTEB相关性极低,凸显了推荐场景中语义编码的独特挑战。