Machine learning (ML) and artificial intelligence (AI) approaches are often criticized for their inherent bias and for their lack of control, accountability, and transparency. Consequently, regulatory bodies struggle with containing this technology's potential negative side effects. High-level requirements such as fairness and robustness need to be formalized into concrete specification metrics, imperfect proxies that capture isolated aspects of the underlying requirements. Given possible trade-offs between different metrics and their vulnerability to over-optimization, integrating specification metrics in system development processes is not trivial. This paper defines specification overfitting, a scenario where systems focus excessively on specified metrics to the detriment of high-level requirements and task performance. We present an extensive literature survey to categorize how researchers propose, measure, and optimize specification metrics in several AI fields (e.g., natural language processing, computer vision, reinforcement learning). Using a keyword-based search on papers from major AI conferences and journals between 2018 and mid-2023, we identify and analyze 74 papers that propose or optimize specification metrics. We find that although most papers implicitly address specification overfitting (e.g., by reporting more than one specification metric), they rarely discuss which role specification metrics should play in system development or explicitly define the scope and assumptions behind metric formulations.
翻译:机器学习(ML)与人工智能(AI)方法常因其固有的偏见以及缺乏可控性、问责制和透明度而受到批评。因此,监管机构难以遏制该技术潜在的负面影响。诸如公平性和鲁棒性等高层次需求,需要被形式化为具体的规范度量——这些度量是捕捉底层需求孤立方面的、不完美的代理指标。考虑到不同度量之间可能存在的权衡关系及其对过度优化的脆弱性,将规范度量整合到系统开发过程中并非易事。本文定义了“规范过拟合”,即系统过度专注于指定的度量,从而损害高层次需求和任务性能的情形。我们通过广泛的文献调研,对研究人员在多个AI领域(例如自然语言处理、计算机视觉、强化学习)中如何提出、衡量和优化规范度量进行了分类。基于对2018年至2023年中期间主要AI会议和期刊论文的关键词检索,我们识别并分析了74篇提出或优化规范度量的论文。我们发现,尽管大多数论文都隐含地涉及了规范过拟合问题(例如,通过报告不止一种规范度量),但它们很少讨论规范度量应在系统开发中扮演何种角色,也极少明确界定度量公式背后的范围和假设。