Collaborative competitions have gained popularity in the scientific and technological fields. These competitions involve defining tasks, selecting evaluation scores, and devising result verification methods. In the standard scenario, participants receive a training set and are expected to provide a solution for a held-out dataset kept by organizers. An essential challenge for organizers arises when comparing algorithms' performance, assessing multiple participants, and ranking them. Statistical tools are often used for this purpose; however, traditional statistical methods often fail to capture decisive differences between systems' performance. This manuscript describes an evaluation methodology for statistically analyzing competition results and competition. The methodology is designed to be universally applicable; however, it is illustrated using eight natural language competitions as case studies involving classification and regression problems. The proposed methodology offers several advantages, including off-the-shell comparisons with correction mechanisms and the inclusion of confidence intervals. Furthermore, we introduce metrics that allow organizers to assess the difficulty of competitions. Our analysis shows the potential usefulness of our methodology for effectively evaluating competition results.
翻译:协作竞赛已在科学和技术领域日益流行。这类竞赛涉及定义任务、选择评估指标以及设计结果验证方法。在标准场景中,参赛者会收到训练集,并需为竞赛组织者保留的测试集提供解决方案。组织者面临的一个关键挑战是如何比较不同算法的性能、评估多个参赛者并对其进行排名。统计工具通常用于此目的;然而,传统统计方法往往难以捕捉系统性能之间的决定性差异。本文描述了一种用于统计分析竞赛结果与竞赛本身的评估方法。该方法旨在具有普适性;然而,本文以八个自然语言处理竞赛(涉及分类与回归问题)为案例研究加以说明。所提出的方法具有多项优势,包括可直接使用的带校正机制的比较方法以及置信区间的引入。此外,我们引入了可用于评估竞赛难度的指标。我们的分析表明,该方法在有效评估竞赛结果方面具有潜在实用价值。