While contextualized word embeddings have been a de-facto standard, learning contextualized phrase embeddings is less explored and being hindered by the lack of a human-annotated benchmark that tests machine understanding of phrase semantics given a context sentence or paragraph (instead of phrases alone). To fill this gap, we propose PiC -- a dataset of ~28K of noun phrases accompanied by their contextual Wikipedia pages and a suite of three tasks for training and evaluating phrase embeddings. Training on PiC improves ranking models' accuracy and remarkably pushes span-selection (SS) models (i.e., predicting the start and end index of the target phrase) near-human accuracy, which is 95% Exact Match (EM) on semantic search given a query phrase and a passage. Interestingly, we find evidence that such impressive performance is because the SS models learn to better capture the common meaning of a phrase regardless of its actual context. SotA models perform poorly in distinguishing two senses of the same phrase in two contexts (~60% EM) and in estimating the similarity between two different phrases in the same context (~70% EM).
翻译:摘要:尽管上下文词嵌入已成为事实标准,但学习上下文短语嵌入的相关研究尚不充分,其进展受到缺乏人工标注基准的制约——该基准需在给定上下文句子或段落(而非仅短语本身)的条件下检验机器对短语语义的理解能力。为填补这一空白,我们提出PiC数据集,包含约28K个名词短语及其对应的维基百科上下文页面,并配套三项用于训练与评估短语嵌入的任务。在PiC上训练可提升排序模型的准确性,并显著将跨度选择(SS)模型(即预测目标短语起始与结束索引的模型)的精度推至接近人类水平:在给定查询短语与段落的情况下,语义搜索的精确匹配率(EM)达到95%。有趣的是,我们发现这一惊人表现的原因在于SS模型学会了更好地捕捉短语的通用语义(无论其实际上下文如何)。当前最优模型在区分同一短语于不同上下文中的两种语义时表现不佳(EM约60%),且在评估同一上下文中两个不同短语的相似性时亦显不足(EM约70%)。