The rapid advancement of high-quality image generation models based on AI has generated a deluge of anime illustrations. Recommending illustrations to users within massive data has become a challenging and popular task. However, existing anime recommendation systems have focused on text features but still need to integrate image features. In addition, most multi-modal recommendation research is constrained by tightly coupled datasets, limiting its applicability to anime illustrations. We propose the User-aware Multi-modal Animation Illustration Recommendation Fusion with Painting Style (UMAIR-FPS) to tackle these gaps. In the feature extract phase, for image features, we are the first to combine image painting style features with semantic features to construct a dual-output image encoder for enhancing representation. For text features, we obtain text embeddings based on fine-tuning Sentence-Transformers by incorporating domain knowledge that composes a variety of domain text pairs from multilingual mappings, entity relationships, and term explanation perspectives, respectively. In the multi-modal fusion phase, we novelly propose a user-aware multi-modal contribution measurement mechanism to weight multi-modal features dynamically according to user features at the interaction level and employ the DCN-V2 module to model bounded-degree multi-modal crosses effectively. UMAIR-FPS surpasses the stat-of-the-art baselines on large real-world datasets, demonstrating substantial performance enhancements.
翻译:基于人工智能的高质量图像生成模型的快速发展催生了海量的动漫插图。在庞大数据中向用户推荐插图已成为一项具有挑战性且热门的任务。然而,现有的动漫推荐系统主要关注文本特征,仍需集成图像特征。此外,大多数多模态推荐研究受限于紧密耦合的数据集,限制了其在动漫插图中的应用。为解决这些不足,我们提出了基于绘画风格的用户感知多模态动漫插图推荐融合(UMAIR-FPS)。在特征提取阶段,针对图像特征,我们首次将图像绘画风格特征与语义特征相结合,构建了双输出图像编码器以增强表征能力。针对文本特征,我们通过融入领域知识——分别从多语言映射、实体关系和术语解释角度构建多样化的领域文本对——对Sentence-Transformers进行微调,从而获取文本嵌入。在多模态融合阶段,我们创新性地提出了用户感知的多模态贡献度量机制,在交互层面根据用户特征动态加权多模态特征,并采用DCN-V2模块有效建模有界度的多模态交叉。UMAIR-FPS在大型真实世界数据集上超越了现有最先进基线,展示了显著的性能提升。