DReX: An Explainable Deep Learning-based Multimodal Recommendation Framework

Multimodal recommender systems leverage diverse data sources, such as user interactions, content features, and contextual information, to address challenges like cold-start and data sparsity. However, existing methods often suffer from one or more key limitations: processing different modalities in isolation, requiring complete multimodal data for each interaction during training, or independent learning of user and item representations. These factors contribute to increased complexity and potential misalignment between user and item embeddings. To address these challenges, we propose DReX, a unified multimodal recommendation framework that incrementally refines user and item representations by leveraging interaction-level features from multimodal feedback. Our model employs gated recurrent units to selectively integrate these fine-grained features into global representations. This incremental update mechanism provides three key advantages: (1) simultaneous modeling of both nuanced interaction details and broader preference patterns, (2) eliminates the need for separate user and item feature extraction processes, leading to enhanced alignment in their learned representation, and (3) inherent robustness to varying or missing modalities. We evaluate the performance of the proposed approach on three real-world datasets containing reviews and ratings as interaction modalities. By considering review text as a modality, our approach automatically generates interpretable keyword profiles for both users and items, which supplement the recommendation process with interpretable preference indicators. Experiment results demonstrate that our approach outperforms state-of-the-art methods across all evaluated datasets.

翻译：多模态推荐系统利用多样化的数据源，例如用户交互、内容特征和上下文信息，以应对冷启动和数据稀疏性等挑战。然而，现有方法通常存在一个或多个关键局限：孤立地处理不同模态、训练期间需要每次交互的完整多模态数据，或者独立学习用户与物品表示。这些因素导致复杂性增加以及用户与物品嵌入之间可能存在的错位。为应对这些挑战，我们提出DReX，一个统一的多模态推荐框架，它通过利用来自多模态反馈的交互级特征，逐步优化用户和物品表示。我们的模型采用门控循环单元，有选择地将这些细粒度特征整合到全局表示中。这种增量更新机制具有三个关键优势：(1) 同时建模细微的交互细节和更广泛的偏好模式，(2) 消除了单独的用户和物品特征提取过程，从而增强了所学表示之间的对齐性，(3) 对变化或缺失的模态具有固有的鲁棒性。我们在三个包含评论和评分作为交互模态的真实世界数据集上评估了所提方法的性能。通过将评论文本视为一种模态，我们的方法能够自动为用户和物品生成可解释的关键词画像，这些画像为推荐过程提供了可解释的偏好指标。实验结果表明，我们的方法在所有评估的数据集上均优于现有最先进方法。