The evaluation of natural language processing (NLP) systems is crucial for advancing the field, but current benchmarking approaches often assume that all systems have scores available for all tasks, which is not always practical. In reality, several factors such as the cost of running baseline, private systems, computational limitations, or incomplete data may prevent some systems from being evaluated on entire tasks. This paper formalize an existing problem in NLP research: benchmarking when some systems scores are missing on the task, and proposes a novel approach to address it. Our method utilizes a compatible partial ranking approach to impute missing data, which is then aggregated using the Borda count method. It includes two refinements designed specifically for scenarios where either task-level or instance-level scores are available. We also introduce an extended benchmark, which contains over 131 million scores, an order of magnitude larger than existing benchmarks. We validate our methods and demonstrate their effectiveness in addressing the challenge of missing system evaluation on an entire task. This work highlights the need for more comprehensive benchmarking approaches that can handle real-world scenarios where not all systems are evaluated on the entire task.
翻译:自然语言处理(NLP)系统的评估对于推动该领域的发展至关重要,但当前的基准测试方法通常假设所有系统在所有任务上均有可用得分,而这在实践中往往难以实现。现实中,运行基线系统的成本、私有系统限制、计算资源约束或数据不完整等因素,可能导致部分系统无法完成所有任务的评估。本文正式阐述了NLP研究中一个现存问题:当部分系统在任务上存在缺失得分时的基准测试,并提出了一种新颖的解决方案。我们的方法利用兼容的部分排序方法对缺失数据进行插补,随后通过Borda计数法进行聚合。该方法包含两项专为任务级或实例级得分可用场景设计的优化。我们还引入了一个扩展基准测试,包含超过1.31亿个得分,规模比现有基准测试高出一个数量级。我们验证了所提方法,并证明了其在处理系统评估中完整任务缺失得分挑战上的有效性。这项工作凸显了开发更全面的基准测试方法的必要性,以应对并非所有系统都能完成完整任务评估的真实场景。