Machine learning (ML) and artificial intelligence (AI) approaches are often criticized for their inherent bias and for their lack of control, accountability, and transparency. Consequently, regulatory bodies struggle with containing this technology's potential negative side effects. High-level requirements such as fairness and robustness need to be formalized into concrete specification metrics, imperfect proxies that capture isolated aspects of the underlying requirements. Given possible trade-offs between different metrics and their vulnerability to over-optimization, integrating specification metrics in system development processes is not trivial. This paper defines specification overfitting, a scenario where systems focus excessively on specified metrics to the detriment of high-level requirements and task performance. We present an extensive literature survey to categorize how researchers propose, measure, and optimize specification metrics in several AI fields (e.g., natural language processing, computer vision, reinforcement learning). Using a keyword-based search on papers from major AI conferences and journals between 2018 and mid-2023, we identify and analyze 74 papers that propose or optimize specification metrics. We find that although most papers implicitly address specification overfitting (e.g., by reporting more than one specification metric), they rarely discuss which role specification metrics should play in system development or explicitly define the scope and assumptions behind metric formulations.
翻译:机器学习(ML)和人工智能(AI)方法常因其固有偏见以及缺乏可控性、问责制和透明度而受到批评。因此,监管机构在控制该技术的潜在负面副作用方面面临挑战。诸如公平性和鲁棒性等高层需求需要被形式化为具体的规格指标——这些不完美的代理指标只能捕捉到底层需求的孤立方面。考虑到不同指标之间可能存在的权衡以及它们对过度优化的脆弱性,将规格指标整合到系统开发过程中并非易事。本文定义了“规格过拟合”这一概念,即系统过度关注特定指标,从而损害高层需求和任务性能的现象。我们通过广泛的文献综述,对多个AI领域(如自然语言处理、计算机视觉、强化学习)的研究者如何提出、测量及优化规格指标进行了分类。基于对2018年至2023年中期主要AI会议和期刊论文的关键词检索,我们识别并分析了74篇提出或优化规格指标的论文。研究发现,尽管大多数论文隐含地涉及了规格过拟合问题(例如,通过报告多个规格指标),但它们很少讨论规格指标在系统开发中应扮演的角色,也未明确界定指标公式背后的适用范围和假设。