Climate change (CC) has attracted increasing attention in NLP in recent years. However, detecting the stance on CC in multimodal data is understudied and remains challenging due to a lack of reliable datasets. To improve the understanding of public opinions and communication strategies, this paper presents MultiClimate, the first open-source manually-annotated stance detection dataset with $100$ CC-related YouTube videos and $4,209$ frame-transcript pairs. We deploy state-of-the-art vision and language models, as well as multimodal models for MultiClimate stance detection. Results show that text-only BERT significantly outperforms image-only ResNet50 and ViT. Combining both modalities achieves state-of-the-art, $0.747$/$0.749$ in accuracy/F1. Our 100M-sized fusion models also beat CLIP and BLIP, as well as the much larger 9B-sized multimodal IDEFICS and text-only Llama3 and Gemma2, indicating that multimodal stance detection remains challenging for large language models. Our code, dataset, as well as supplementary materials, are available at https://github.com/werywjw/MultiClimate.
翻译:近年来,气候变化(CC)在自然语言处理领域受到越来越多的关注。然而,在多模态数据中检测关于气候变化的立场尚未得到充分研究,并且由于缺乏可靠的数据集,仍然具有挑战性。为了增进对公众意见和传播策略的理解,本文提出了MultiClimate,这是首个开源的人工标注立场检测数据集,包含$100$个与气候变化相关的YouTube视频和$4,209$个帧-转录本对。我们部署了最先进的视觉与语言模型以及多模态模型用于MultiClimate的立场检测。结果表明,纯文本的BERT模型显著优于纯图像的ResNet50和ViT模型。结合两种模态可以达到最先进的性能,准确率/F1分数分别为$0.747$/$0.749$。我们参数量为1亿的融合模型也击败了CLIP和BLIP,以及参数量大得多的90亿参数多模态模型IDEFICS和纯文本模型Llama3与Gemma2,这表明多模态立场检测对于大语言模型而言仍然具有挑战性。我们的代码、数据集以及补充材料可在 https://github.com/werywjw/MultiClimate 获取。