The quality assessment of Artificial Intelligence (AI) systems is a fundamental challenge due to their inherently probabilistic nature. Standards such as ISO/IEC 25059 provide a quality model, but they lack practical and statistically robust methods for assessing functional correctness. This paper proposes and evaluates the Statistical Confidence in Functional Correctness (SCFC) approach, which seeks to fill this gap by connecting business requirements to a measure of statistical confidence that considers both the model's average performance and its variability. The approach consists of four steps: defining quantitative specification limits, performing stratified and probabilistic sampling, applying bootstrapping to estimate a confidence interval for the performance metric, and calculating a capability index as a final indicator. The approach was evaluated through a case study on two real-world AI systems in industry involving interviews with AI experts. Valuable insights were collected from the experts regarding the utility, ease of use, and intention to adopt the methodology in practical scenarios. We conclude that the proposed approach is a feasible and valuable way to operationalize the assessment of functional correctness, moving the evaluation from a point estimate to a statement of statistical confidence.
翻译:人工智能(AI)系统的质量评估因其固有的概率性本质而成为一个根本性挑战。诸如ISO/IEC 25059等标准提供了质量模型,但缺乏用于评估功能正确性的实用且统计稳健的方法。本文提出并评估了功能正确性统计置信度(SCFC)方法,旨在通过将业务需求与一个同时考虑模型平均性能及其变异性的统计置信度度量联系起来,以填补这一空白。该方法包含四个步骤:定义量化规格限、执行分层与概率抽样、应用自助法估计性能指标的置信区间,以及计算能力指数作为最终指标。该方法通过对两个工业界真实AI系统的案例研究,并结合对AI专家的访谈进行了评估。我们从专家处收集了关于该方法在实际场景中的实用性、易用性及采纳意愿的宝贵见解。我们得出结论,所提出的方法是实现功能正确性评估操作化的一种可行且有价值的途径,将评估从点估计推进到统计置信度的表述。