The ability to generate sentiment-controlled feedback in response to multimodal inputs, comprising both text and images, addresses a critical gap in human-computer interaction by enabling systems to provide empathetic, accurate, and engaging responses. This capability has profound applications in healthcare, marketing, and education. To this end, we construct a large-scale Controllable Multimodal Feedback Synthesis (CMFeed) dataset and propose a controllable feedback synthesis system. The proposed system includes an encoder, decoder, and controllability block for textual and visual inputs. It extracts textual and visual features using a transformer and Faster R-CNN networks and combines them to generate feedback. The CMFeed dataset encompasses images, text, reactions to the post, human comments with relevance scores, and reactions to the comments. The reactions to the post and comments are utilized to train the proposed model to produce feedback with a particular (positive or negative) sentiment. A sentiment classification accuracy of 77.23% has been achieved, 18.82% higher than the accuracy without using the controllability. Moreover, the system incorporates a similarity module for assessing feedback relevance through rank-based metrics. It implements an interpretability technique to analyze the contribution of textual and visual features during the generation of uncontrolled and controlled feedback.
翻译:针对包含文本与图像的多模态输入生成情感可控的反馈,这一能力通过使系统能够提供共情、准确且具有吸引力的响应,填补了人机交互领域的关键空白。该能力在医疗健康、市场营销和教育领域具有深远应用前景。为此,我们构建了大规模可控多模态反馈合成(CMFeed)数据集,并提出了一套可控反馈合成系统。该系统包含编码器、解码器以及面向文本与视觉输入的可控模块,通过Transformer和Faster R-CNN网络分别提取文本与视觉特征,并融合生成反馈。CMFeed数据集涵盖图像、文本、帖子反馈、带相关性评分的人工评论以及评论反馈,其中帖子反馈与评论反馈被用于训练模型以生成特定情感倾向(正面或负面)的反馈。实验表明,模型的情感分类准确率达到77.23%,较未使用可控模块时提升18.82%。此外,系统集成相似度模块,通过基于排序的指标评估反馈相关性,并采用可解释性技术分析文本特征与视觉特征在非受控与受控反馈生成过程中的贡献。