Contemporary news reporting increasingly features multimedia content, motivating research on multimedia event extraction. However, the task lacks annotated multimodal training data and artificially generated training data suffer from the distribution shift from the real-world data. In this paper, we propose Cross-modality Augmented Multimedia Event Learning (CAMEL), which successfully utilizes artificially generated multimodal training data and achieves state-of-the-art performance. Conditioned on unimodal training data, we generate multimodal training data using off-the-shelf image generators like Stable Diffusion and image captioners like BLIP. In order to learn robust features that are effective across domains, we devise an iterative and gradual annealing training strategy. Substantial experiments show that CAMEL surpasses state-of-the-art (SOTA) baselines on the M2E2 benchmark. On multimedia events in particular, we outperform the prior SOTA by 4.2\% F1 on event mention identification and by 9.8\% F1 on argument identification, which demonstrates that CAMEL learns synergistic representations from the two modalities.
翻译:当代新闻报道日益呈现多媒体化趋势,这推动了多媒体事件抽取的相关研究。然而,该任务缺乏带标注的多模态训练数据,而人工生成的训练数据又存在与真实数据分布偏移的问题。本文提出跨模态增强多媒体事件学习(CAMEL)方法,成功利用人工生成的多模态训练数据取得了最先进的性能。以单模态训练数据为条件,我们借助现成的图像生成器(如Stable Diffusion)和图像描述器(如BLIP)生成多模态训练数据。为学习跨领域有效的鲁棒特征,我们设计了迭代渐进退火训练策略。大量实验表明,CAMEL在M2E2基准上超越了现有最先进(SOTA)基线方法。尤其在多媒体事件方面,事件提及识别F1值提升4.2%,论元识别F1值提升9.8%,证明了CAMEL能从两种模态中学习协同表征。