How does one measure "ability to understand language"? If it is a person's ability that is being measured, this is a question that almost never poses itself in an unqualified manner: Whatever formal test is applied, it takes place on the background of the person's language use in daily social practice, and what is measured is a specialised variety of language understanding (e.g., of a second language; or of written, technical language). Computer programs do not have this background. What does that mean for the applicability of formal tests of language understanding? I argue that such tests need to be complemented with tests of language use embedded in a practice, to arrive at a more comprehensive evaluation of "artificial language understanding". To do such tests systematically, I propose to use "Dialogue Games" -- constructed activities that provide a situational embedding for language use. I describe a taxonomy of Dialogue Game types, linked to a model of underlying capabilites that are tested, and thereby giving an argument for the \emph{construct validity} of the test. I close with showing how the internal structure of the taxonomy suggests an ordering from more specialised to more general situational language understanding, which potentially can provide some strategic guidance for development in this field.
翻译:如何衡量“语言理解能力”?如果衡量的是人的能力,那么这个问题几乎从不会以无条件的形态出现:无论采用何种正式测试,其背景始终是人在日常社会实践中使用语言的经验,而所测得的只是语言理解的特殊变体(例如第二语言理解,或书面技术语言理解)。计算机程序不具备这种背景。这对正式语言理解测试的适用性意味着什么?本文主张,此类测试必须辅以嵌入实践的语言使用测试,才能对“人工语言理解”进行更全面的评估。为系统开展此类测试,我提出使用“对话游戏”——一种为语言使用提供情境嵌入的构造性活动。我描述了对话游戏类型的分类体系,并将其与所测试的底层能力模型相关联,从而论证该测试的构念效度。最后,我展示了分类体系的内在结构如何表明从特殊情境语言理解向普遍情境语言理解的排序,这可能为该领域的发展提供策略性指导。