In recent decades, challenges have become very popular in scientific research as these are crowdsourcing schemes. In particular, challenges are essential for developing machine learning algorithms. For the challenges settings, it is vital to establish the scientific question, the dataset (with adequate quality, quantity, diversity, and complexity), performance metrics, as well as a way to authenticate the participants' results (Gold Standard). This paper addresses the problem of evaluating the performance of different competitors (algorithms) under the restrictions imposed by the challenge scheme, such as the comparison of multiple competitors with a unique dataset (with fixed size), a minimal number of submissions and, a set of metrics chosen to assess performance. The algorithms are sorted according to the performance metric. Still, it is common to observe performance differences among competitors as small as hundredths or even thousandths, so the question is whether the differences are significant. This paper analyzes the results of the MeOffendEs@IberLEF 2021 competition and proposes to make inference through resampling techniques (bootstrap) to support Challenge organizers' decision-making.
翻译:近几十年来,挑战作为一种众包机制在科学研究中日益流行。特别是,挑战对于开发机器学习算法至关重要。在挑战设置中,必须明确科学问题、数据集(需具备足够的质量、数量、多样性和复杂性)、性能指标,以及验证参与者结果的方式(金标准)。本文探讨了在挑战方案限制下评估不同竞争对手(算法)性能的问题,例如在单一固定规模数据集上对多个竞争对手进行比较、限制提交次数,以及选择一组评估性能的指标。算法根据性能指标排序。然而,通常会观察到竞争对手之间性能差异小至百分位甚至千分位,因此问题在于这些差异是否具有统计显著性。本文分析了MeOffendEs@IberLEF 2021竞赛的结果,并提出通过重采样技术(bootstrap)进行推断,以支持挑战组织者的决策。