Natural language understanding (NLU) studies often exaggerate or underestimate the capabilities of systems, thereby limiting the reproducibility of their findings. These erroneous evaluations can be attributed to the difficulty of defining and testing NLU adequately. In this position paper, we reconsider this challenge by identifying two types of researcher degrees of freedom. We revisit Turing's original interpretation of the Turing test and indicate that an NLU test does not provide an operational definition; it merely provides inductive evidence that the test subject understands the language sufficiently well to meet stakeholder objectives. In other words, stakeholders are free to arbitrarily define NLU through their objectives. To use the test results as inductive evidence, stakeholders must carefully assess if the interpretation of test scores is valid or not. However, designing and using NLU tests involve other degrees of freedom, such as specifying target skills and defining evaluation metrics. As a result, achieving consensus among stakeholders becomes difficult. To resolve this issue, we propose a validity argument, which is a framework comprising a series of validation criteria across test components. By demonstrating that current practices in NLU studies can be associated with those criteria and organizing them into a comprehensive checklist, we prove that the validity argument can serve as a coherent guideline for designing credible test sets and facilitating scientific communication.
翻译:自然语言理解(NLU)研究常夸大或低估系统能力,从而限制了研究结果的可复现性。此类错误评估可归因于对NLU进行充分定义与测试的困难。在本立场论文中,我们通过识别两类研究者自由度重新审视这一挑战。我们重新解读图灵测试的原始含义,指出NLU测试并不提供操作性定义,而仅提供归纳性证据——证明测试对象对语言的掌握程度足以满足利益相关方的目标。换言之,利益相关方可依据自身目标自由定义NLU。为将测试结果用作归纳性证据,利益相关方必须审慎评估测试分数解释的有效性。然而,设计与使用NLU测试还涉及其他自由度,如指定目标技能与定义评估指标。这导致利益相关方难以达成共识。为解决此问题,我们提出效度论证框架,该框架包含贯穿测试组件的系列验证标准。通过论证当前NLU研究实践可与这些标准关联,并将其整合为综合性检查清单,我们证明效度论证可作为设计可信测试集并促进科学交流的连贯性指导原则。