Functional trustworthiness of AI systems by statistically valid testing

The authors are concerned about the safety, health, and rights of the European citizens due to inadequate measures and procedures required by the current draft of the EU Artificial Intelligence (AI) Act for the conformity assessment of AI systems. We observe that not only the current draft of the EU AI Act, but also the accompanying standardization efforts in CEN/CENELEC, have resorted to the position that real functional guarantees of AI systems supposedly would be unrealistic and too complex anyways. Yet enacting a conformity assessment procedure that creates the false illusion of trust in insufficiently assessed AI systems is at best naive and at worst grossly negligent. The EU AI Act thus misses the point of ensuring quality by functional trustworthiness and correctly attributing responsibilities. The trustworthiness of an AI decision system lies first and foremost in the correct statistical testing on randomly selected samples and in the precision of the definition of the application domain, which enables drawing samples in the first place. We will subsequently call this testable quality functional trustworthiness. It includes a design, development, and deployment that enables correct statistical testing of all relevant functions. We are firmly convinced and advocate that a reliable assessment of the statistical functional properties of an AI system has to be the indispensable, mandatory nucleus of the conformity assessment. In this paper, we describe the three necessary elements to establish a reliable functional trustworthiness, i.e., (1) the definition of the technical distribution of the application, (2) the risk-based minimum performance requirements, and (3) the statistically valid testing based on independent random samples.

翻译：作者关注到，由于现行《欧盟人工智能法案》（EU AI Act）草案中针对AI系统合格评定的措施与程序存在不足，欧洲公民的安全、健康及权利可能面临风险。我们注意到，不仅现行EU AI Act草案，就连CEN/CENELEC的配套标准化工作也持此立场，认为对AI系统提供真正的功能保障实属不切实际且过于复杂。然而，推行一种在未充分评估的AI系统上制造虚假信任感的合格评定程序，往轻了说是天真，往重了说是严重失职。因此，EU AI Act未能把握通过功能可信度确保质量并正确归责的要义。AI决策系统的可信度首先取决于对随机选取样本的正确统计检验，以及应用领域定义的精确性——正是后者使得样本抽取成为可能。我们后文将这种可检验的质量称为"功能可信度"。它要求AI系统的设计、开发及部署流程必须能够支持对所有相关功能实施正确的统计检验。我们坚信并主张：对AI系统统计功能特性的可靠评估，必须成为合格评定中不可或缺的强制性核心。本文阐述了建立可靠功能可信度所需的三个必要元素：（1）应用的技术分布定义；（2）基于风险的最低性能要求；（3）基于独立随机样本的统计有效测试。