Evaluation plays a significant role in modern natural language processing. Most modern NLP benchmarks consist of arbitrary sets of tasks that neither guarantee any generalization potential for the model once applied outside the test set nor try to minimize the resource consumption needed for model evaluation. This paper presents a theoretical instrument and a practical algorithm to calculate similarity between benchmark tasks, we call this similarity measure "Vygotsky distance". The core idea of this similarity measure is that it is based on relative performance of the "students" on a given task, rather that on the properties of the task itself. If two tasks are close to each other in terms of Vygotsky distance the models tend to have similar relative performance on them. Thus knowing Vygotsky distance between tasks one can significantly reduce the number of evaluation tasks while maintaining a high validation quality. Experiments on various benchmarks, including GLUE, SuperGLUE, CLUE, and RussianSuperGLUE, demonstrate that a vast majority of NLP benchmarks could be at least 40% smaller in terms of the tasks included. Most importantly, Vygotsky distance could also be used for the validation of new tasks thus increasing the generalization potential of the future NLP models.
翻译:评估在现代自然语言处理中发挥着重要作用。大多数现代NLP基准测试由任意任务集构成,这些任务既无法保证模型在测试集以外的泛化潜力,也未尽可能减少模型评估所需的资源消耗。本文提出了一种计算基准任务间相似性的理论工具与实践算法,我们将这种相似性度量称为"维果茨基距离"。该度量方法的核心思想在于:它基于"学生"模型在特定任务上的相对性能,而非任务本身的固有属性。当两个任务在维果茨基距离上彼此接近时,模型往往会在它们上表现出相似的相对性能。因此,通过掌握任务间的维果茨基距离,可以在保持较高验证质量的同时显著减少评估任务的数量。在GLUE、SuperGLUE、CLUE和RussianSuperGLUE等多个基准测试上的实验表明,绝大多数NLP基准测试的任务数量至少可缩减40%。更重要的是,维果茨基距离还可用于新任务的验证,从而提升未来NLP模型的泛化潜力。