While contextualized word embeddings have been a de-facto standard, learning contextualized phrase embeddings is less explored and being hindered by the lack of a human-annotated benchmark that tests machine understanding of phrase semantics given a context sentence or paragraph (instead of phrases alone). To fill this gap, we propose PiC -- a dataset of ~28K of noun phrases accompanied by their contextual Wikipedia pages and a suite of three tasks for training and evaluating phrase embeddings. Training on PiC improves ranking models' accuracy and remarkably pushes span-selection (SS) models (i.e., predicting the start and end index of the target phrase) near-human accuracy, which is 95% Exact Match (EM) on semantic search given a query phrase and a passage. Interestingly, we find evidence that such impressive performance is because the SS models learn to better capture the common meaning of a phrase regardless of its actual context. SotA models perform poorly in distinguishing two senses of the same phrase in two contexts (~60% EM) and in estimating the similarity between two different phrases in the same context (~70% EM).
翻译:摘要:尽管上下文词嵌入已成为事实标准,但上下文短语嵌入的学习仍鲜有探索,其发展受限于缺乏一个经人工标注的基准测试——该测试需评估机器在给定上下文句子或段落(而非孤立短语)中对短语语义的理解能力。为填补这一空白,我们提出了PiC——一个包含约2.8万个名词短语的数据集,每个短语附有对应的维基百科上下文页面,并配套设计了三个任务用于短语嵌入的训练与评估。在PiC上进行训练能提升排序模型的准确率,并显著将跨度选择(SS)模型(即预测目标短语的起始和结束索引)的精度提升至接近人类水平——在给定查询短语和段落的语义搜索任务中达到95%的精确匹配(EM)。有趣的是,我们发现这种卓越性能源于SS模型学会了更好捕捉短语的通用含义,而非依赖其实际上下文。当前最先进(SotA)模型在区分同一短语在不同上下文中的两种含义时表现较差(约60% EM),在评估同一上下文中两个不同短语的相似性时也表现不佳(约70% EM)。