Multimodal recommender systems amalgamate multimodal information (e.g., textual descriptions, images) into a collaborative filtering framework to provide more accurate recommendations. While the incorporation of multimodal information could enhance the interpretability of these systems, current multimodal models represent users and items utilizing entangled numerical vectors, rendering them arduous to interpret. To address this, we propose a Disentangled Graph Variational Auto-Encoder (DGVAE) that aims to enhance both model and recommendation interpretability. DGVAE initially projects multimodal information into textual contents, such as converting images to text, by harnessing state-of-the-art multimodal pre-training technologies. It then constructs a frozen item-item graph and encodes the contents and interactions into two sets of disentangled representations utilizing a simplified residual graph convolutional network. DGVAE further regularizes these disentangled representations through mutual information maximization, aligning the representations derived from the interactions between users and items with those learned from textual content. This alignment facilitates the interpretation of user binary interactions via text. Our empirical analysis conducted on three real-world datasets demonstrates that DGVAE significantly surpasses the performance of state-of-the-art baselines by a margin of 10.02%. We also furnish a case study from a real-world dataset to illustrate the interpretability of DGVAE. Code is available at: \url{https://github.com/enoche/DGVAE}.
翻译:多模态推荐系统将多模态信息(如文本描述、图像)融合至协同过滤框架中,以提供更精准的推荐。尽管融合多模态信息可提升系统可解释性,但现有模型使用耦合的数值向量表征用户与物品,导致其难以解释。为此,本文提出解耦图变分自编码器(DGVAE),旨在提升模型与推荐的双重可解释性。DGVAE首先利用先进的多模态预训练技术将多模态信息投射为文本内容(如图像转文本),然后构建冻结的物品-物品图,并采用简化残差图卷积网络将内容与交互编码为两组解耦表征。进一步通过互信息最大化约束这些解耦表征,使从用户-物品交互中获得的表征与从文本内容中学得的表征对齐,从而通过文本解释用户二元交互。在三个真实数据集上的实证分析表明,DGVAE以10.02%的边际显著超越现有最优基线模型。我们还通过真实数据集案例研究展示了DGVAE的可解释性。代码开源地址:\url{https://github.com/enoche/DGVAE}。