What is the state of the art? Accounting for multiplicity in machine learning benchmark performance - 专知论文

会员服务 ·

0

SOTA · 多样性 · 分类器 · 基准 · 机器学习 ·

2023 年 4 月 3 日

What is the state of the art? Accounting for multiplicity in machine learning benchmark performance

翻译：何谓当前最优？机器学习基准性能中的多重性问题

Kajsa Møllersen,Einar Holsbø

Machine learning methods are commonly evaluated and compared by their performance on data sets from public repositories. This allows for multiple methods, oftentimes several thousands, to be evaluated under identical conditions and across time. The highest ranked performance on a problem is referred to as state-of-the-art (SOTA) performance, and is used, among other things, as a reference point for publication of new methods. Using the highest-ranked performance as an estimate for SOTA is a biased estimator, giving overly optimistic results. The mechanisms at play are those of multiplicity, a topic that is well-studied in the context of multiple comparisons and multiple testing, but has, as far as the authors are aware of, been nearly absent from the discussion regarding SOTA estimates. The optimistic state-of-the-art estimate is used as a standard for evaluating new methods, and methods with substantial inferior results are easily overlooked. In this article, we provide a probability distribution for the case of multiple classifiers so that known analyses methods can be engaged and a better SOTA estimate can be provided. We demonstrate the impact of multiplicity through a simulated example with independent classifiers. We show how classifier dependency impacts the variance, but also that the impact is limited when the accuracy is high. Finally, we discuss a real-world example; a Kaggle competition from 2020.

翻译：机器学习方法通常通过其在公共数据集上的表现进行评估和比较。这使得多种方法（往往多达数千种）能够在相同条件下跨时间进行评估。某一问题上排名最高的性能被称为当前最优（SOTA）性能，并作为新方法发表时的参考标准之一。然而，将排名最高的性能作为SOTA的估计值存在偏差，导致结果过于乐观。这种机制的本质是多重性问题——该问题在多重比较和多重假设检验领域已有深入研究，但据作者所知，在关于SOTA估计的讨论中几乎未曾涉及。这种乐观的SOTA估计被用作新方法的评估标准，导致性能明显较差的方法容易被忽视。本文针对多个分类器的情况给出了概率分布，使得已知的分析方法得以应用，并能提供更优的SOTA估计。我们通过一个独立分类器的模拟示例展示了多重性的影响，并揭示了分类器相关性对方差的影响，同时表明当准确率较高时这种影响有限。最后，我们讨论了一个真实案例：2020年的Kaggle竞赛。

0

相关内容

SOTA

2022最新MIT成果【ICML 2022】：一种提高人工智能的公平性和准确性的技术，Selective Regression Under Fairness Criteria

2022最新MIT成果【ICML 2022】：一种提高人工智能的公平性和准确性的技术，Selective Regression Under Fairness Criteria

专知会员服务

11+阅读 · 2022年7月26日

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

哥伦比亚大学最新博士论文《机器学习在金融市场中的应用》Essays on the Applications of Machine Learning in Financial Markets

哥伦比亚大学最新博士论文《机器学习在金融市场中的应用》Essays on the Applications of Machine Learning in Financial Markets

专知会员服务

28+阅读 · 2022年4月8日

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

专知会员服务

135+阅读 · 2021年6月16日

ICLR 2021杰出论文奖出炉，8篇论文上榜！

专知会员服务

26+阅读 · 2021年4月2日

【Google 大脑】使用上千个优化任务学习超参数搜索策略，Using a thousand optimization tasks to learn hyperparameter search strategies

【Google 大脑】使用上千个优化任务学习超参数搜索策略，Using a thousand optimization tasks to learn hyperparameter search strategies

专知会员服务

18+阅读 · 2020年3月14日

【哈佛大学】机器学习的层次局限性，A Hierarchy of Limitations in Machine Learning

【哈佛大学】机器学习的层次局限性，A Hierarchy of Limitations in Machine Learning

专知会员服务

47+阅读 · 2020年2月12日

【论文推荐WWW2020-UIUC】修正排序系统中的选择偏差：Correcting for Selection Bias in Learning-to-rank Systems

【论文推荐WWW2020-UIUC】修正排序系统中的选择偏差：Correcting for Selection Bias in Learning-to-rank Systems

专知会员服务

32+阅读 · 2020年2月1日

【NeurlPS2019论文总结】一致收敛可能无法解释深度学习中的泛化现象，Uniform convergence may be unable to explain generalization in deep learning

【NeurlPS2019论文总结】一致收敛可能无法解释深度学习中的泛化现象，Uniform convergence may be unable to explain generalization in deep learning

专知会员服务

15+阅读 · 2019年12月17日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

GNN 新基准！Long Range Graph Benchmark

GNN 新基准！Long Range Graph Benchmark

图与推荐

0+阅读 · 2022年10月18日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

【泡泡一分钟】优化对比度增强以提高SLAM重定位环境中视觉跟踪的稳健性

【泡泡一分钟】优化对比度增强以提高SLAM重定位环境中视觉跟踪的稳健性

泡泡机器人SLAM

10+阅读 · 2019年4月26日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

AI界的State of the Art都在这里了

AI界的State of the Art都在这里了

机器之心

12+阅读 · 2018年12月10日

【论文推荐】最新十篇目标跟踪相关论文—多帧光流跟踪、动态图学习、MV-YOLO、姿态估计、深度核相关滤波、Benchmark

【论文推荐】最新十篇目标跟踪相关论文—多帧光流跟踪、动态图学习、MV-YOLO、姿态估计、深度核相关滤波、Benchmark

专知

13+阅读 · 2018年5月26日

笔记 | Sentiment Analysis

笔记 | Sentiment Analysis

黑龙江大学自然语言处理实验室

10+阅读 · 2018年5月6日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

基于深度学习的高分辨率PolSAR影像暗目标判别

国家自然科学基金

3+阅读 · 2015年12月31日

复合水泥浆体组成和微结构的定量演变规律及其与力学性能间的关系模型

国家自然科学基金

0+阅读 · 2014年12月31日

传感网中面向链路干扰最小化的拓扑控制

国家自然科学基金

0+阅读 · 2013年12月31日

考虑降雨和地形的长江上游非点源污染时空演变模拟研究

国家自然科学基金

0+阅读 · 2013年12月31日

miR-146a靶向IRAK1与TRAF6调控非小细胞肺癌转移的机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

高分辨DFT调制滤波器组的设计及其在通信信号处理中的应用

国家自然科学基金

0+阅读 · 2012年12月31日

多天线无线通信系统的鲁棒性设计

国家自然科学基金

2+阅读 · 2012年12月31日

乙二醛-尿素-甲醛共缩聚树脂的合成、性能及反应机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

滇西老厂富银红土型锰矿次生富集机制及40Ar/39Ar年龄

国家自然科学基金

0+阅读 · 2012年12月31日

甘油选择性氢解用高分散纳米双金属催化剂的设计与性能研究

国家自然科学基金

0+阅读 · 2009年12月31日

Archetypal Analysis++: Rethinking the Initialization Strategy

Arxiv

0+阅读 · 2023年5月25日

Off-Policy Evaluation with Online Adaptation for Robot Exploration in Challenging Environments

Arxiv

0+阅读 · 2023年5月24日

No-Regret Online Prediction with Strategic Experts

Arxiv

0+阅读 · 2023年5月24日

Analysis of modular CMA-ES on strict box-constrained problems in the SBOX-COST benchmarking suite

Arxiv

0+阅读 · 2023年5月24日

On Learning to Summarize with Large Language Models as References

Arxiv

0+阅读 · 2023年5月23日

A framework to measure the robustness of programs in the unpredictable environment

Arxiv

0+阅读 · 2023年5月23日

Exploring and Exploiting Data Heterogeneity in Recommendation

Arxiv

0+阅读 · 2023年5月21日

Benchmarking the human brain against computational architectures

Arxiv

0+阅读 · 2023年5月15日

Automated Graph Machine Learning: Approaches, Libraries and Directions

Arxiv

20+阅读 · 2022年1月4日

Heterogeneous Network Representation Learning: A Unified Framework with Survey and Benchmark

Heterogeneous Network Representation Learning: A Unified Framework with Survey and Benchmark

Arxiv

19+阅读 · 2020年12月17日

VIP会员

文章信息

相关主题

最新内容

对抗环境下超视距目标打击的情报支援

对抗环境下超视距目标打击的情报支援

专知会员服务

8+阅读 · 7月22日

《面向复杂地形下无人机跟踪地面机器人（UAV–UGV）的自适应多滤波器扩展卡尔曼滤波框架》

《面向复杂地形下无人机跟踪地面机器人（UAV–UGV）的自适应多滤波器扩展卡尔曼滤波框架》

专知会员服务

3+阅读 · 7月22日

纵深侦察：大规模作战行动中远程侦察与监视之迫切需求

纵深侦察：大规模作战行动中远程侦察与监视之迫切需求

专知会员服务

7+阅读 · 7月22日

共享认知，分布式研判：复杂行动中的美国空军指挥控制（万字长文）

共享认知，分布式研判：复杂行动中的美国空军指挥控制（万字长文）

专知会员服务

7+阅读 · 7月22日

《无人机对海面作战影响评估》

《无人机对海面作战影响评估》

专知会员服务

15+阅读 · 7月21日

《可损耗无人系统规模化应用对美国军事转型的战略影响（2022-2030）》2026年270页

《可损耗无人系统规模化应用对美国军事转型的战略影响（2022-2030）》2026年270页

专知会员服务

12+阅读 · 7月21日

博士论文 | 后训练如何损害大模型生成多样性？SimpleStrat与Stylus

博士论文 | 后训练如何损害大模型生成多样性？SimpleStrat与Stylus

专知会员服务

4+阅读 · 7月21日

综述 | 面向5G/6G网络的LLM智能体AI：架构、协议与标准化

综述 | 面向5G/6G网络的LLM智能体AI：架构、协议与标准化

专知会员服务

6+阅读 · 7月21日

五角大楼新设无人机办公室（DRPM-UxS）将如何重塑美国无人系统格局（附美国防部设立备忘录）

五角大楼新设无人机办公室（DRPM-UxS）将如何重塑美国无人系统格局（附美国防部设立备忘录）

专知会员服务

9+阅读 · 7月21日

印度精确打击与指挥架构的断层

印度精确打击与指挥架构的断层

专知会员服务

7+阅读 · 7月20日

《NASA喷气推进实验室：高耐久轻质常驻空观测系统（HELIOS）》429页

《NASA喷气推进实验室：高耐久轻质常驻空观测系统（HELIOS）》429页

专知会员服务

9+阅读 · 7月20日

美空军AI完成F-16战斗机自主空战历史性试飞

美空军AI完成F-16战斗机自主空战历史性试飞

专知会员服务

8+阅读 · 7月20日

《美政府问责局——武器系统年度评估（2026年）：强制要求成熟技术或可推动转向快速交付》249页

《美政府问责局——武器系统年度评估（2026年）：强制要求成熟技术或可推动转向快速交付》249页

专知会员服务

10+阅读 · 7月20日

《美国陆军：通过弹性分布式模型库实现自适应AI优势》

《美国陆军：通过弹性分布式模型库实现自适应AI优势》

专知会员服务

9+阅读 · 7月20日

博士论文 | 理解与改进大语言模型推理：从反转诅咒到连续思维链

博士论文 | 理解与改进大语言模型推理：从反转诅咒到连续思维链

专知会员服务

10+阅读 · 7月20日

相关VIP内容

2022最新MIT成果【ICML 2022】：一种提高人工智能的公平性和准确性的技术，Selective Regression Under Fairness Criteria

2022最新MIT成果【ICML 2022】：一种提高人工智能的公平性和准确性的技术，Selective Regression Under Fairness Criteria

专知会员服务

11+阅读 · 2022年7月26日

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

哥伦比亚大学最新博士论文《机器学习在金融市场中的应用》Essays on the Applications of Machine Learning in Financial Markets

哥伦比亚大学最新博士论文《机器学习在金融市场中的应用》Essays on the Applications of Machine Learning in Financial Markets

专知会员服务

28+阅读 · 2022年4月8日

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

专知会员服务

135+阅读 · 2021年6月16日

ICLR 2021杰出论文奖出炉，8篇论文上榜！

专知会员服务

26+阅读 · 2021年4月2日

【Google 大脑】使用上千个优化任务学习超参数搜索策略，Using a thousand optimization tasks to learn hyperparameter search strategies

【Google 大脑】使用上千个优化任务学习超参数搜索策略，Using a thousand optimization tasks to learn hyperparameter search strategies

专知会员服务

18+阅读 · 2020年3月14日

【哈佛大学】机器学习的层次局限性，A Hierarchy of Limitations in Machine Learning

【哈佛大学】机器学习的层次局限性，A Hierarchy of Limitations in Machine Learning

专知会员服务

47+阅读 · 2020年2月12日

【论文推荐WWW2020-UIUC】修正排序系统中的选择偏差：Correcting for Selection Bias in Learning-to-rank Systems

【论文推荐WWW2020-UIUC】修正排序系统中的选择偏差：Correcting for Selection Bias in Learning-to-rank Systems

专知会员服务

32+阅读 · 2020年2月1日

【NeurlPS2019论文总结】一致收敛可能无法解释深度学习中的泛化现象，Uniform convergence may be unable to explain generalization in deep learning

【NeurlPS2019论文总结】一致收敛可能无法解释深度学习中的泛化现象，Uniform convergence may be unable to explain generalization in deep learning

专知会员服务

15+阅读 · 2019年12月17日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

《面向复杂地形下无人机跟踪地面机器人（UAV–UGV）的自适应多滤波器扩展卡尔曼滤波框架》

共享认知，分布式研判：复杂行动中的美国空军指挥控制（万字长文）

对抗环境下超视距目标打击的情报支援

纵深侦察：大规模作战行动中远程侦察与监视之迫切需求

相关资讯

GNN 新基准！Long Range Graph Benchmark

GNN 新基准！Long Range Graph Benchmark

图与推荐

0+阅读 · 2022年10月18日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

【泡泡一分钟】优化对比度增强以提高SLAM重定位环境中视觉跟踪的稳健性

【泡泡一分钟】优化对比度增强以提高SLAM重定位环境中视觉跟踪的稳健性

泡泡机器人SLAM

10+阅读 · 2019年4月26日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

AI界的State of the Art都在这里了

AI界的State of the Art都在这里了

机器之心

12+阅读 · 2018年12月10日

【论文推荐】最新十篇目标跟踪相关论文—多帧光流跟踪、动态图学习、MV-YOLO、姿态估计、深度核相关滤波、Benchmark

【论文推荐】最新十篇目标跟踪相关论文—多帧光流跟踪、动态图学习、MV-YOLO、姿态估计、深度核相关滤波、Benchmark

专知

13+阅读 · 2018年5月26日

笔记 | Sentiment Analysis

笔记 | Sentiment Analysis

黑龙江大学自然语言处理实验室

10+阅读 · 2018年5月6日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

相关论文

Archetypal Analysis++: Rethinking the Initialization Strategy

Arxiv

0+阅读 · 2023年5月25日

Off-Policy Evaluation with Online Adaptation for Robot Exploration in Challenging Environments

Arxiv

0+阅读 · 2023年5月24日

No-Regret Online Prediction with Strategic Experts

Arxiv

0+阅读 · 2023年5月24日

Analysis of modular CMA-ES on strict box-constrained problems in the SBOX-COST benchmarking suite

Arxiv

0+阅读 · 2023年5月24日

On Learning to Summarize with Large Language Models as References

Arxiv

0+阅读 · 2023年5月23日

A framework to measure the robustness of programs in the unpredictable environment

Arxiv

0+阅读 · 2023年5月23日

Exploring and Exploiting Data Heterogeneity in Recommendation

Arxiv

0+阅读 · 2023年5月21日

Benchmarking the human brain against computational architectures

Arxiv

0+阅读 · 2023年5月15日

Automated Graph Machine Learning: Approaches, Libraries and Directions

Arxiv

20+阅读 · 2022年1月4日

Heterogeneous Network Representation Learning: A Unified Framework with Survey and Benchmark

Heterogeneous Network Representation Learning: A Unified Framework with Survey and Benchmark

Arxiv

19+阅读 · 2020年12月17日

相关基金

基于深度学习的高分辨率PolSAR影像暗目标判别

国家自然科学基金

3+阅读 · 2015年12月31日

复合水泥浆体组成和微结构的定量演变规律及其与力学性能间的关系模型

国家自然科学基金

0+阅读 · 2014年12月31日

传感网中面向链路干扰最小化的拓扑控制

国家自然科学基金

0+阅读 · 2013年12月31日

考虑降雨和地形的长江上游非点源污染时空演变模拟研究

国家自然科学基金

0+阅读 · 2013年12月31日

miR-146a靶向IRAK1与TRAF6调控非小细胞肺癌转移的机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

高分辨DFT调制滤波器组的设计及其在通信信号处理中的应用

国家自然科学基金

0+阅读 · 2012年12月31日

多天线无线通信系统的鲁棒性设计

国家自然科学基金

2+阅读 · 2012年12月31日

乙二醛-尿素-甲醛共缩聚树脂的合成、性能及反应机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

滇西老厂富银红土型锰矿次生富集机制及40Ar/39Ar年龄

国家自然科学基金

0+阅读 · 2012年12月31日

甘油选择性氢解用高分散纳米双金属催化剂的设计与性能研究

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员