Multimodal recommender systems improve the performance of canonical recommender systems with no item features by utilizing diverse content types such as text, images, and videos, while alleviating inherent sparsity of user-item interactions and accelerating user engagement. However, current neural network-based models often incur significant computational overhead due to the complex training process required to learn and integrate information from multiple modalities. To address this challenge, we propose a training-free multimodal recommendation method grounded in graph filtering, designed for multimodal recommendation systems to achieve efficient and accurate recommendation. Specifically, the proposed method first constructs multiple similarity graphs for two distinct modalities as well as user-item interaction data. Then, it optimally fuses these multimodal signals using a polynomial graph filter that allows for precise control of the frequency response by adjusting frequency bounds. Furthermore, the filter coefficients are treated as hyperparameters, enabling flexible and data-driven adaptation. Extensive experiments on real-world benchmark datasets demonstrate that the proposed method not only improves recommendation accuracy by up to 22.25% compared to the best competitor but also dramatically reduces computational costs by achieving the runtime of less than 10 seconds.
翻译:多模态推荐系统通过利用文本、图像和视频等多种内容类型,在缓解用户-物品交互固有稀疏性并加速用户参与的同时,提升了无物品特征的经典推荐系统的性能。然而,当前基于神经网络的模型由于需要复杂的训练过程来学习和整合多种模态的信息,通常会产生显著的计算开销。为应对这一挑战,我们提出了一种基于图滤波的无需训练的多模态推荐方法,旨在实现高效且准确的多模态推荐。具体而言,该方法首先为两种不同模态以及用户-物品交互数据构建多个相似度图。随后,通过能够通过调整频率界精确控制频率响应的多项式图滤波器,最优地融合这些多模态信号。此外,滤波器系数被视为超参数,从而实现了灵活且数据驱动的自适应调整。在真实世界基准数据集上的大量实验表明,所提方法不仅相比最佳竞争模型将推荐准确率提升高达22.25%,而且通过实现不足10秒的运行时间,大幅降低了计算成本。