Effort-Optimized, Accuracy-Driven Labelling and Validation of Test Inputs for DL Systems: A Mixed-Integer Linear Programming Approach

Software systems increasingly include AI components based on deep learning (DL). Reliable testing of such systems requires near-perfect test-input validity and label accuracy, with minimal human effort. Yet, the DL community has largely overlooked the need to build highly accurate datasets with minimal effort, since DL training is generally tolerant of labelling errors. This challenge, instead, reflects concerns more familiar to software engineering, where a central goal is to construct high-accuracy test inputs, with accuracy as close to 100% as possible, while keeping associated costs in check. In this article we introduce OPAL, a human-assisted labelling method that can be configured to target a desired accuracy level while minimizing the manual effort required for labelling. The main contribution of OPAL is a mixed-integer linear programming (MILP) formulation that minimizes labelling effort subject to a specified accuracy target. To evaluate OPAL we instantiate it for two tasks in the context of testing vision systems: automatic labelling of test inputs and automated validation of test inputs. Our evaluation, based on more than 2500 experiments performed on nine datasets, comparing OPAL with eight baseline methods, shows that OPAL, relying on its MILP formulation, achieves an average accuracy of 98.8%, while cutting manual labelling by more than half. OPAL significantly outperforms automated labelling baselines in labelling accuracy across all nine datasets, when all methods are provided with the same manual-labelling budget. For automated test-input validation, on average, OPAL reduces manual effort by 28.8% while achieving 4.5% higher accuracy than the SOTA test-input validation baselines. Finally, we show that augmenting OPAL with an active-learning loop leads to an additional 4.5% reduction in required manual labelling, without compromising accuracy.

翻译：摘要：软件系统日益包含基于深度学习（DL）的人工智能组件。对此类系统进行可靠测试需要近乎完美的测试输入有效性与标注精度，同时将人力投入降至最低。然而，深度学习社区在很大程度上忽略了以最小努力构建高精度数据集的需求，因为DL训练通常对标注错误具有一定容忍度。相反，这一挑战更契合软件工程领域的关注重点——其核心目标是以尽可能接近100%的精度构建高准确性测试输入，同时控制相关成本。本文提出OPAL，一种可配置的人工辅助标注方法，旨在针对目标精度水平最小化所需的手动标注努力。OPAL的核心贡献在于提出一种混合整数线性规划（MILP）公式，在指定精度目标约束下最小化标注工作。为评估OPAL，我们将其实例化于视觉系统测试中的两项任务：测试输入的自动标注与自动验证。基于九个数据集、超过2500次实验的评估（将OPAL与八种基线方法对比）表明：依赖其MILP公式的OPAL在将手动标注量削减超过一半的同时，实现了平均98.8%的精度。在所有方法获得相同手动标注预算的情况下，OPAL在所有九个数据集上的标注精度均显著优于自动标注基线方法。在测试输入的自动验证方面，OPAL平均削减28.8%的手动工作量，同时比当前最先进的测试输入验证基线方法精度提高4.5%。最后，我们证明将OPAL与主动学习循环相结合可在不牺牲精度的前提下，额外减少4.5%的手动标注需求。