Incorporating auxiliary modalities such as images into event detection models has attracted increasing interest over the last few years. The complexity of natural language in describing situations has motivated researchers to leverage the related visual context to improve event detection performance. However, current approaches in this area suffer from data scarcity, where a large amount of labelled text-image pairs are required for model training. Furthermore, limited access to the visual context at inference time negatively impacts the performance of such models, which makes them practically ineffective in real-world scenarios. In this paper, we present a novel domain-adaptive visually-fused event detection approach that can be trained on a few labelled image-text paired data points. Specifically, we introduce a visual imaginator method that synthesises images from text in the absence of visual context. Moreover, the imaginator can be customised to a specific domain. In doing so, our model can leverage the capabilities of pre-trained vision-language models and can be trained in a few-shot setting. This also allows for effective inference where only single-modality data (i.e. text) is available. The experimental evaluation on the benchmark M2E2 dataset shows that our model outperforms existing state-of-the-art models, by up to 11 points.
翻译:融合图像等辅助模态到事件检测模型近年来引起了越来越多的关注。自然语言在描述情境时的复杂性促使研究者利用相关的视觉上下文来提高事件检测性能。然而,当前该领域的方法面临数据稀缺问题,需要大量标注的文本-图像对进行模型训练。此外,推理时视觉上下文的有限访问会对这些模型的性能产生负面影响,使其在实际场景中效果不佳。本文提出了一种新颖的领域自适应视觉融合事件检测方法,该方法可在少量标注的图像-文本配对数据点上进行训练。具体而言,我们引入了一种视觉想象器方法,在没有视觉上下文的情况下从文本合成图像。此外,该想象器可针对特定领域进行定制。这样,我们的模型能够利用预训练视觉-语言模型的能力,并在少样本设置下进行训练。这也使得在仅提供单模态数据(即文本)的情况下实现有效推理。在基准M2E2数据集上的实验评估表明,我们的模型优于现有最先进模型,最高提升达11个百分点。