Accurate processing of non-compositional language relies on generating good representations for such expressions. In this work, we study the representation of language non-compositionality by proposing a language model, PIER, that builds on BART and can create semantically meaningful and contextually appropriate representations for English potentially idiomatic expressions (PIEs). PIEs are characterized by their non-compositionality and contextual ambiguity in their literal and idiomatic interpretations. Via intrinsic evaluation on embedding quality and extrinsic evaluation on PIE processing and NLU tasks, we show that representations generated by PIER result in 33% higher homogeneity score for embedding clustering than BART, whereas 3.12% and 3.29% gains in accuracy and sequence accuracy for PIE sense classification and span detection compared to the state-of-the-art IE representation model, GIEA. These gains are achieved without sacrificing PIER's performance on NLU tasks (+/- 1% accuracy) compared to BART.
翻译:准确处理非组合性语言依赖于为这类表达生成良好的表示。本文通过提出基于BART构建的语言模型PIER,研究语言的非组合性表示,该模型能为英语潜习语表达(PIEs)生成具有语义意义且上下文适当的表示。PIEs的特征在于其非组合性以及在字面义和习语义理解中的上下文歧义。通过对嵌入质量的内在评估,以及PIE处理与自然语言理解任务的外在评估,我们证明PIER生成的表示在嵌入聚类中相较于BART实现了33%的同质性得分提升,同时在PIE词义分类与跨度检测任务中,相较于最先进的习语表达表示模型GIEA,准确率与序列准确率分别提升了3.12%和3.29%。这些提升并未牺牲PIER在NLU任务上的性能(与BART相比准确率变化在±1%以内)。