Effectively leveraging multimodal information from social media posts is essential to various downstream tasks such as sentiment analysis, sarcasm detection and hate speech classification. However, combining text and image information is challenging because of the idiosyncratic cross-modal semantics with hidden or complementary information present in matching image-text pairs. In this work, we aim to directly model this by proposing the use of two auxiliary losses jointly with the main task when fine-tuning any pre-trained multimodal model. Image-Text Contrastive (ITC) brings image-text representations of a post closer together and separates them from different posts, capturing underlying dependencies. Image-Text Matching (ITM) facilitates the understanding of semantic correspondence between images and text by penalizing unrelated pairs. We combine these objectives with five multimodal models, demonstrating consistent improvements across four popular social media datasets. Furthermore, through detailed analysis, we shed light on the specific scenarios and cases where each auxiliary task proves to be most effective.
翻译:有效利用社交媒体帖子中的多模态信息对于情感分析、讽刺检测和仇恨言论分类等多种下游任务至关重要。然而,由于图文匹配对中存在的隐藏或互补信息导致跨模态语义具有特殊性,因此结合文本和图像信息颇具挑战性。在本工作中,我们旨在通过提出在微调任何预训练多模态模型时,将两个辅助损失与主任务联合使用,直接对此进行建模。图像-文本对比(ITC)使得同一帖子的图文表示更接近,同时分离不同帖子的表示,从而捕捉潜在的依赖关系。图像-文本匹配(ITM)通过惩罚不相关对,促进对图像与文本间语义对应关系的理解。我们将这些目标与五种多模态模型相结合,在四个流行的社交媒体数据集上展现出持续的性能提升。此外,通过详细分析,我们揭示了每个辅助任务在特定场景和案例中最有效的情况。