The Controllable Multimodal Feedback Synthesis (CMFeed) dataset enables the generation of sentiment-controlled feedback from multimodal inputs. It contains images, text, human comments, comments' metadata and sentiment labels. Existing datasets for related tasks such as multimodal summarization, visual question answering, visual dialogue, and sentiment-aware text generation do not incorporate training models using human-generated outputs and their metadata, a gap that CMFeed addresses. This capability is critical for developing feedback systems that understand and replicate human-like spontaneous responses. Based on the CMFeed dataset, we define a novel task of controllable feedback synthesis to generate context-aware feedback aligned with the desired sentiment. We propose a benchmark feedback synthesis system comprising encoder, decoder, and controllability modules. It employs transformer and Faster R-CNN networks to extract features and generate sentiment-specific feedback, achieving a sentiment classification accuracy of 77.23%, which is 18.82% higher than models not leveraging the dataset's unique controllability features. Additionally, we incorporate a similarity module for relevance assessment through rank-based metrics.
翻译:可控多模态反馈合成(CMFeed)数据集支持从多模态输入生成情感可控的反馈。该数据集包含图像、文本、人工评论、评论元数据及情感标签。现有相关任务(如多模态摘要、视觉问答、视觉对话及情感感知文本生成)的数据集均未利用人工生成的输出及其元数据进行模型训练,CMFeed填补了这一空白。此能力对于开发能够理解并复现类人自发响应的反馈系统至关重要。基于CMFeed数据集,我们定义了可控反馈合成这一新任务,旨在生成与目标情感一致且具有上下文感知能力的反馈。我们提出了一种包含编码器、解码器与控制模块的基准反馈合成系统。该系统采用Transformer与Faster R-CNN网络进行特征提取与情感特异性反馈生成,情感分类准确率达到77.23%,较未利用数据集独特可控性特征的模型提升18.82%。此外,我们通过基于排序的指标引入相似度模块以进行相关性评估。