The ability to generate sentiment-controlled feedback in response to multimodal inputs comprising text and images addresses a critical gap in human-computer interaction. This capability allows systems to provide empathetic, accurate, and engaging responses, with useful applications in education, healthcare, marketing, and customer service. To this end, we have constructed a large-scale Controllable Multimodal Feedback Synthesis (CMFeed) dataset and propose a controllable feedback synthesis system. The system features an encoder, decoder, and controllability block for textual and visual inputs. It extracts features using a transformer and Faster R-CNN networks, combining them to generate feedback. The CMFeed dataset includes images, texts, reactions to the posts, human comments with relevance scores, and reactions to these comments. These reactions train the model to produce feedback with specified sentiments, achieving a sentiment classification accuracy of 77.23\%, which is 18.82\% higher than the accuracy without controllability. The system also incorporates a similarity module for assessing feedback relevance through rank-based metrics and an interpretability technique to analyze the contributions of textual and visual features during feedback generation. Access to the CMFeed dataset and the system's code is available at https://github.com/MIntelligence-Group/CMFeed.
翻译:生成针对包含文本与图像的多模态输入的情感可控反馈能力,填补了人机交互领域的一项关键空白。该能力使系统能够提供具有同理心、准确且引人入胜的回应,在教育、医疗保健、市场营销和客户服务等领域具有重要应用价值。为此,我们构建了一个大规模的可控多模态反馈合成(CMFeed)数据集,并提出了一种可控反馈合成系统。该系统包含针对文本与视觉输入的编码器、解码器及可控性模块,利用Transformer和Faster R-CNN网络提取特征并融合以生成反馈。CMFeed数据集包含图像、文本、对帖子的反应、带相关性评分的人工评论,以及对评论的反应。这些反应数据用于训练模型生成具有指定情感的反馈,实现了77.23%的情感分类准确率,较不可控模型提升18.82%。系统还集成相似性模块,通过基于排序的指标评估反馈相关性,并采用可解释性技术分析文本与视觉特征在反馈生成过程中的贡献。CMFeed数据集及系统代码可通过https://github.com/MIntelligence-Group/CMFeed获取。