The question of how to probe contextual word representations for linguistic structure in a way that is both principled and useful has seen significant attention recently in the NLP literature. In our contribution to this discussion, we argue for a probe metric that reflects the fundamental trade-off between probe complexity and performance: the Pareto hypervolume. To measure complexity, we present a number of parametric and non-parametric metrics. Our experiments using Pareto hypervolume as an evaluation metric show that probes often do not conform to our expectations -- e.g., why should the non-contextual fastText representations encode more morpho-syntactic information than the contextual BERT representations? These results suggest that common, simplistic probing tasks, such as part-of-speech labeling and dependency arc labeling, are inadequate to evaluate the linguistic structure encoded in contextual word representations. This leads us to propose full dependency parsing as a probing task. In support of our suggestion that harder probing tasks are necessary, our experiments with dependency parsing reveal a wide gap in syntactic knowledge between contextual and non-contextual representations.
翻译:如何以既有原则性又具实用性的方式探测上下文词表示中的语言结构,这一问题近期在自然语言处理文献中受到显著关注。在对此讨论的贡献中,我们主张采用一种能够反映探测复杂度与性能之间基本权衡的评估指标:帕累托超体积。为衡量复杂度,我们提出了若干参数化与非参数化指标。使用帕累托超体积作为评估指标进行的实验表明,探测结果往往不符合我们的预期——例如,为何非上下文的fastText表示比上下文的BERT表示编码了更多形态句法信息?这些结果表明,常见的简化探测任务(如词性标注与依存弧标注)不足以评估上下文词表示中编码的语言结构。这促使我们提出将完整依存句法分析作为探测任务。为支持必须使用更困难探测任务的观点,我们的依存句法分析实验揭示了上下文与非上下文表示之间在句法知识层面存在显著差距。