Potentially idiomatic expressions (PIEs) construe meanings inherently tied to the everyday experience of a given language community. As such, they constitute an interesting challenge for assessing the linguistic (and to some extent cultural) capabilities of NLP systems. In this paper, we present XMPIE, a parallel multilingual and multimodal dataset of potentially idiomatic expressions. The dataset, containing 34 languages and over ten thousand items, allows comparative analyses of idiomatic patterns among language-specific realisations and preferences in order to gather insights about shared cultural aspects. This parallel dataset allows to evaluate model performance for a given PIE in different languages and whether idiomatic understanding in one language can be transferred to another. Moreover, the dataset supports the study of PIEs across textual and visual modalities, to measure to what extent PIE understanding in one modality transfers or implies in understanding in another modality (text vs. image). The data was created by language experts, with both textual and visual components crafted under multilingual guidelines, and each PIE is accompanied by five images representing a spectrum from idiomatic to literal meanings, including semantically related and random distractors. The result is a high-quality benchmark for evaluating multilingual and multimodal idiomatic language understanding.
翻译:潜在习语表达(PIE)构建的意义与特定语言社群的日常经验内在相关。因此,它们为评估NLP系统的语言能力(及某种程度上的文化能力)构成了一个有趣的挑战。本文提出XMPIE——一个平行多语言多模态的潜在习语表达数据集。该数据集涵盖34种语言、包含超过一万个条目,支持对语言特定实现方式和偏好中的习语模式进行比较分析,从而获取关于共享文化层面的洞见。此平行数据集能够评估模型在多种语言中对特定PIE的表现,以及习语理解能力是否可在语言间迁移。此外,该数据集支持跨文本与视觉模态的PIE研究,用以衡量单模态(文本vs图像)的PIE理解能力在何种程度上能迁移或暗示另一模态的理解。数据由语言专家创建,文本与视觉组件均在多语言准则下精心构建,每个PIE均配有五幅图像,呈现从习语意义到字面意义的连续谱,并包含语义相关及随机干扰项。最终形成的高质量基准可用于评估多语言多模态的习语理解能力。