Feature-Aware Test Generation for Deep Learning Models

As deep learning models are widely used in software systems, test generation plays a crucial role in assessing the quality of such models before deployment. To date, the most advanced test generators rely on generative AI to synthesize inputs; however, these approaches remain limited in providing semantic insight into the causes of misbehaviours and in offering fine-grained semantic controllability over the generated inputs. In this paper, we introduce Detect, a feature-aware test generation framework for vision-based deep learning (DL) models that systematically generates inputs by perturbing disentangled semantic attributes within the latent space. Detect perturbs individual latent features in a controlled way and observes how these changes affect the model's output. Through this process, it identifies which features lead to behavior shifts and uses a vision-language model for semantic attribution. By distinguishing between task-relevant and irrelevant features, Detect applies feature-aware perturbations targeted for both generalization and robustness. Empirical results across image classification and detection tasks show that Detect generates high-quality test cases with fine-grained control, reveals distinct shortcut behaviors across model architectures (convolutional and transformer-based), and bugs that are not captured by accuracy metrics. Specifically, Detect outperforms a state-of-the-art test generator in decision boundary discovery and a leading spurious feature localization method in identifying robustness failures. Our findings show that fully fine-tuned convolutional models are prone to overfitting on localized cues, such as co-occurring visual traits, while weakly supervised transformers tend to rely on global features, such as environmental variances. These findings highlight the value of interpretable and feature-aware testing in improving DL model reliability.

翻译：随着深度学习模型在软件系统中的广泛应用，测试生成在模型部署前评估其质量方面发挥着关键作用。迄今为止，最先进的测试生成器依赖生成式人工智能来合成输入；然而，这些方法在提供对错误行为原因的语义洞察以及提供对生成输入的细粒度语义可控性方面仍然存在局限。本文介绍了Detect，一个面向基于视觉的深度学习模型的特征感知测试生成框架，它通过在潜在空间中扰动解耦的语义属性来系统性地生成输入。Detect以受控方式扰动单个潜在特征，并观察这些变化如何影响模型的输出。通过此过程，它识别哪些特征导致了行为偏移，并利用视觉语言模型进行语义归因。通过区分任务相关与无关特征，Detect应用针对泛化性和鲁棒性的特征感知扰动。在图像分类和检测任务上的实证结果表明，Detect生成了具有细粒度控制的高质量测试用例，揭示了不同模型架构（基于卷积和Transformer）的独特捷径行为，以及未被准确率指标捕获的缺陷。具体而言，Detect在决策边界发现方面优于最先进的测试生成器，并在识别鲁棒性故障方面优于领先的伪特征定位方法。我们的研究结果表明，完全微调的卷积模型容易对局部线索（如共现的视觉特征）过拟合，而弱监督的Transformer模型则倾向于依赖全局特征（如环境变化）。这些发现凸显了可解释且特征感知的测试在提升深度学习模型可靠性方面的价值。