Where to Go Next for Recommender Systems? ID- vs. Modality-based recommender models revisited

Recommendation models that utilize unique identities (IDs) to represent distinct users and items have been state-of-the-art (SOTA) and dominated the recommender systems (RS) literature for over a decade. Meanwhile, the pre-trained modality encoders, such as BERT and ViT, have become increasingly powerful in modeling the raw modality features of an item, such as text and images. Given this, a natural question arises: can a purely modality-based recommendation model (MoRec) outperforms or matches a pure ID-based model (IDRec) by replacing the itemID embedding with a SOTA modality encoder? In fact, this question was answered ten years ago when IDRec beats MoRec by a strong margin in both recommendation accuracy and efficiency. We aim to revisit this `old' question and systematically study MoRec from several aspects. Specifically, we study several sub-questions: (i) which recommendation paradigm, MoRec or IDRec, performs better in practical scenarios, especially in the general setting and warm item scenarios where IDRec has a strong advantage? does this hold for items with different modality features? (ii) can the latest technical advances from other communities (i.e., natural language processing and computer vision) translate into accuracy improvement for MoRec? (iii) how to effectively utilize item modality representation, can we use it directly or do we have to adjust it with new data? (iv) are there some key challenges for MoRec to be solved in practical applications? To answer them, we conduct rigorous experiments for item recommendations with two popular modalities, i.e., text and vision. We provide the first empirical evidence that MoRec is already comparable to its IDRec counterpart with an expensive end-to-end training method, even for warm item recommendation. Our results potentially imply that the dominance of IDRec in the RS field may be greatly challenged in the future.

翻译：利用唯一标识符（ID）表示不同用户和物品的推荐模型在过去十余年中一直是推荐系统（RS）领域的先进技术（SOTA）并处于主导地位。与此同时，预训练模态编码器（如BERT和ViT）在建模物品原始模态特征（如文本和图像）方面变得日益强大。基于此，一个自然的问题随之产生：通过用SOTA模态编码器替代物品ID嵌入，纯基于模态的推荐模型（MoRec）能否超越或匹敌纯基于ID的模型（IDRec）？事实上，这个问题在十年前已有答案——当时IDRec在推荐准确性和效率两方面均大幅领先MoRec。我们旨在重新审视这一“古老”问题，并从多个维度系统研究MoRec。具体而言，我们探究以下几个子问题：(i) 在实际场景中（尤其是在通用设置和IDRec具有显著优势的流行物品场景下），MoRec与IDRec哪种推荐范式表现更优？这一结论是否适用于不同模态特征的物品？(ii) 其他领域（如自然语言处理和计算机视觉）的最新技术进展能否转化为MoRec的准确率提升？(iii) 如何有效利用物品模态表征？是直接使用还是必须通过新数据进行调整？(iv) MoRec在实际应用中存在哪些关键挑战需待解决？为回答这些问题，我们针对两种主流模态（文本与视觉）的物品推荐开展了严谨实验。首次实验证据表明：即便在流行物品推荐场景中，采用昂贵的端到端训练方法的MoRec已与IDRec性能相当。我们的结果潜在暗示：IDRec在推荐系统领域的主导地位未来可能面临重大挑战。