Despite increased adoption and advances in machine learning (ML), there are studies showing that many ML prototypes do not reach the production stage and that testing is still largely limited to testing model properties, such as model performance, without considering requirements derived from the system it will be a part of, such as throughput, resource consumption, or robustness. This limited view of testing leads to failures in model integration, deployment, and operations. In traditional software development, quality models such as ISO 25010 provide a widely used structured framework to assess software quality, define quality requirements, and provide a common language for communication with stakeholders. A newer standard, ISO 25059, defines a more specific quality model for AI systems. However, a problem with this standard is that it combines system attributes with ML component attributes, which is not helpful for a model developer, as many system attributes cannot be assessed at the component level. In this paper, we present a quality model for ML components that serves as a guide for requirements elicitation and negotiation and provides a common vocabulary for ML component developers and system stakeholders to agree on and define system-derived requirements and focus their testing efforts accordingly. The quality model was validated through a survey in which the participants agreed with its relevance and value. The quality model has been successfully integrated into an open-source tool for ML component testing and evaluation demonstrating its practical application.
翻译:尽管机器学习(ML)的采用日益广泛且技术不断进步,但研究表明,许多机器学习原型未能进入生产阶段,且测试工作仍主要局限于模型属性(如模型性能)的测试,而未能充分考虑其将集成系统所衍生的需求(如吞吐量、资源消耗或鲁棒性)。这种局限的测试视角导致模型在集成、部署和运维阶段出现故障。在传统软件开发中,诸如ISO 25010等质量模型提供了广泛使用的结构化框架,用于评估软件质量、定义质量要求,并为与利益相关者的沟通提供共同语言。较新的标准ISO 25059为人工智能系统定义了更具体的质量模型。然而,该标准存在一个问题:它将系统属性与机器学习组件属性混为一谈,这对模型开发者并无助益,因为许多系统属性无法在组件层面进行评估。本文提出了一种针对机器学习组件的质量模型,该模型可作为需求获取与协商的指南,并为机器学习组件开发者与系统利益相关者提供一套共同词汇,以就系统衍生的需求达成共识、明确定义,并据此聚焦测试工作。该质量模型通过一项调查得到验证,参与者认可其相关性与价值。该模型已成功集成至一个用于机器学习组件测试与评估的开源工具中,展示了其实际应用价值。