Pre-trained multilingual language models (e.g., mBERT, XLM-RoBERTa) have significantly advanced the state-of-the-art for zero-shot cross-lingual information extraction. These language models ubiquitously rely on word segmentation techniques that break a word into smaller constituent subwords. Therefore, all word labeling tasks (e.g. named entity recognition, event detection, etc.), necessitate a pooling strategy that takes the subword representations as input and outputs a representation for the entire word. Taking the task of cross-lingual event detection as a motivating example, we show that the choice of pooling strategy can have a significant impact on the target language performance. For example, the performance varies by up to 16 absolute $f_{1}$ points depending on the pooling strategy when training in English and testing in Arabic on the ACE task. We carry out our analysis with five different pooling strategies across nine languages in diverse multi-lingual datasets. Across configurations, we find that the canonical strategy of taking just the first subword to represent the entire word is usually sub-optimal. On the other hand, we show that attention pooling is robust to language and dataset variations by being either the best or close to the optimal strategy. For reproducibility, we make our code available at https://github.com/isi-boston/ed-pooling.
翻译:预训练多语言模型(如mBERT、XLM-RoBERTa)显著推动了零样本跨语言信息抽取的最新进展。这些语言模型普遍依赖分词技术,将单词拆解为更小的子词单元。因此,所有词级标注任务(如命名实体识别、事件检测等)都需要一种池化策略,以子词表示为输入,输出整个单词的表示。以跨语言事件检测任务为示例,我们表明池化策略的选择会对目标语言性能产生显著影响。例如,在ACE任务中,当使用英语进行训练、阿拉伯语进行测试时,不同池化策略的性能差异可达16个绝对$f_{1}$点。我们针对五个不同池化策略,在包含九种语言的多语言数据集上进行了分析。在各类配置中,我们发现仅取第一个子词来表示整个单词的经典策略通常并非最优。另一方面,我们证明了注意力池化对语言和数据集变化具有鲁棒性,始终表现最佳或接近最优。为确保可复现性,我们将代码公开于https://github.com/isi-boston/ed-pooling。