Existing explanation models generate only text for recommendations but still struggle to produce diverse contents. In this paper, to further enrich explanations, we propose a new task named personalized showcases, in which we provide both textual and visual information to explain our recommendations. Specifically, we first select a personalized image set that is the most relevant to a user's interest toward a recommended item. Then, natural language explanations are generated accordingly given our selected images. For this new task, we collect a large-scale dataset from Google Local (i.e.,~maps) and construct a high-quality subset for generating multi-modal explanations. We propose a personalized multi-modal framework which can generate diverse and visually-aligned explanations via contrastive learning. Experiments show that our framework benefits from different modalities as inputs, and is able to produce more diverse and expressive explanations compared to previous methods on a variety of evaluation metrics.
翻译:现有解释模型仅能为推荐生成文本解释,但仍难以生成多样化的内容。为进一步丰富解释形式,本文提出一项名为"个性化展示"的新任务,该任务通过同时提供文本与视觉信息来解释推荐内容。具体而言,我们首先选取与用户对被推荐物品兴趣最相关的个性化图像集,随后基于所选图像生成相应的自然语言解释。针对这一新任务,我们从Google Local(即地图服务)中收集大规模数据集,并构建高质量子集以生成多模态解释。我们提出一种个性化多模态框架,该框架通过对比学习能够生成多样化且视觉对齐的解释。实验表明,该框架受益于不同模态的输入,在多种评估指标上相较于先前方法能生成更丰富且更具表现力的解释。