Social media is daily creating massive multimedia content with paired image and text, presenting the pressing need to automate the vision and language understanding for various multimodal classification tasks. Compared to the commonly researched visual-lingual data, social media posts tend to exhibit more implicit image-text relations. To better glue the cross-modal semantics therein, we capture hinting features from user comments, which are retrieved via jointly leveraging visual and lingual similarity. Afterwards, the classification tasks are explored via self-training in a teacher-student framework, motivated by the usually limited labeled data scales in existing benchmarks. Substantial experiments are conducted on four multimodal social media benchmarks for image text relation classification, sarcasm detection, sentiment classification, and hate speech detection. The results show that our method further advances the performance of previous state-of-the-art models, which do not employ comment modeling or self-training.
翻译:社交媒体每天生成海量图文配对的多媒体内容,迫切需要自动化的视觉与语言理解能力来应对各类多模态分类任务。与通常研究的视觉-语言数据相比,社交媒体帖子的图像-文本关系往往更为隐性。为更好地融合其中的跨模态语义,我们通过联合利用视觉与语言相似性,从用户评论中捕捉提示性特征。随后,受现有基准测试中标注数据规模通常有限的启发,我们采用教师-学生框架下的自训练方法探索分类任务。在四个社交媒体多模态基准数据集上(涵盖图像文本关系分类、讽刺检测、情感分类与仇恨言论检测)进行了充分实验,结果表明:我们的方法进一步提升了先前未使用评论建模或自训练的最先进模型的性能。