While recommender systems with multi-modal item representations (image, audio, and text), have been widely explored, learning recommendations from multi-modal user interactions (e.g., clicks and speech) remains an open problem. We study the case of multi-modal user interactions in a setting where users engage with a service provider through multiple channels (website and call center). In such cases, incomplete modalities naturally occur, since not all users interact through all the available channels. To address these challenges, we publish a real-world dataset that allows progress in this under-researched area. We further present and benchmark various methods for leveraging multi-modal user interactions for item recommendations, and propose a novel approach that specifically deals with missing modalities by mapping user interactions to a common feature space. Our analysis reveals important interactions between the different modalities and that a frequently occurring modality can enhance learning from a less frequent one.
翻译:尽管多模态商品表征(图像、音频和文本)的推荐系统已得到广泛探索,但利用多模态用户交互(如点击和语音)进行推荐学习仍是一个开放性问题。本研究聚焦于用户通过多渠道(网站和客服中心)与服务提供商互动的场景下的多模态用户交互案例。在此类场景中,由于并非所有用户都会通过所有可用渠道进行交互,模态缺失现象自然发生。为应对这些挑战,我们发布了一个真实世界数据集,旨在推动这一尚未充分研究领域的发展。我们进一步提出并基准测试了多种利用多模态用户交互进行商品推荐的方法,并提出了一种创新方案,通过将用户交互映射到公共特征空间来专门处理模态缺失问题。我们的分析揭示了不同模态之间的重要交互作用,并表明高频模态能够增强低频模态的学习效果。