Machine learning (ML) - based software systems are rapidly gaining adoption across various domains, making it increasingly essential to ensure they perform as intended. This report presents best practices for the Test and Evaluation (T&E) of ML-enabled software systems across its lifecycle. We categorize the lifecycle of ML-enabled software systems into three stages: component, integration and deployment, and post-deployment. At the component level, the primary objective is to test and evaluate the ML model as a standalone component. Next, in the integration and deployment stage, the goal is to evaluate an integrated ML-enabled system consisting of both ML and non-ML components. Finally, once the ML-enabled software system is deployed and operationalized, the T&E objective is to ensure the system performs as intended. Maintenance activities for ML-enabled software systems span the lifecycle and involve maintaining various assets of ML-enabled software systems. Given its unique characteristics, the T&E of ML-enabled software systems is challenging. While significant research has been reported on T&E at the component level, limited work is reported on T&E in the remaining two stages. Furthermore, in many cases, there is a lack of systematic T&E strategies throughout the ML-enabled system's lifecycle. This leads practitioners to resort to ad-hoc T&E practices, which can undermine user confidence in the reliability of ML-enabled software systems. New systematic testing approaches, adequacy measurements, and metrics are required to address the T&E challenges across all stages of the ML-enabled system lifecycle.
翻译:基于机器学习(ML)的软件系统正迅速在各领域获得应用,确保其按预期运行变得愈发关键。本报告提出了ML软件系统全生命周期测试与评估(T&E)的最佳实践。我们将ML软件系统的生命周期划分为三个阶段:组件级、集成与部署、以及部署后。在组件级,主要目标是作为独立组件测试与评估ML模型。其次,在集成与部署阶段,目标是对包含ML和非ML组件的集成ML系统进行评估。最后,一旦ML软件系统部署并投入运行,T&E的目标是确保系统按预期运行。ML软件系统的维护活动贯穿其生命周期,涉及对系统各类资产的维护。鉴于其独特特性,ML软件系统的T&E颇具挑战性。尽管在组件级T&E方面已有大量研究成果,但在其余两个阶段的相关工作却十分有限。此外,在许多情况下,ML软件系统整个生命周期缺乏系统化的T&E策略,导致从业者采用临时性的T&E实践,这可能削弱用户对ML软件系统可靠性的信心。为应对ML系统生命周期各阶段的T&E挑战,亟需新的系统性测试方法、充分性度量和评估指标。