Mutation analysis is a powerful technique for assessing test-suite adequacy, yet conventional approaches suffer from generating redundant, equivalent, or non-executable mutants. These challenges are particularly amplified in Simulink-Stateflow models due to the hierarchical structure these models have, which integrate continuous dynamics with discrete-event behaviors and are widely deployed in safety-critical Cyber-Physical Systems (CPSs). While prior work has explored machine learning and manually engineered mutation operators, these approaches remain constrained by limited training data and scalability issues. Motivated by recent advances in Large Language Models (LLMs), we investigate their potential to generate high-quality, domain-specific mutants for Simulink-Stateflow models. We develop an automated pipeline that converts Simulink-Stateflow models to structured JSON representations and systematically evaluates different mutation and prompting strategies across eight state-of-the-art LLMs. Through a comprehensive empirical study involving 38,400 LLM-generated mutants across four Simulink-Stateflow models, we demonstrate that LLMs generate mutants up to 13x faster than a manually engineered mutation-based baseline while producing significantly fewer equivalent and duplicate mutants and consistently achieving superior mutant quality. Moreover, our analysis reveals that few-shot prompting combined with low-to-medium temperature values yields optimal results. We provide an open-source prototype tool and release our complete dataset to facilitate reproducibility and advance future research in this domain.
翻译:变异分析是评估测试套件充分性的有效技术,但传统方法存在生成冗余、等价或不可执行变异体的缺陷。这些挑战在Simulink-Stateflow模型中尤为突出,因为该类模型具有融合连续动态与离散事件行为的层次化结构,并广泛应用于安全攸关的信息物理系统(CPS)。虽然已有研究探索了机器学习与人工设计的变异算子,但这些方法仍受限于训练数据不足和可扩展性问题。受大型语言模型(LLMs)最新进展的启发,本研究探究其生成高质量、领域特定的Simulink-Stateflow模型变异体的潜力。我们开发了自动化流程,将Simulink-Stateflow模型转换为结构化JSON表示,并系统评估了八种前沿LLM的不同变异与提示策略。通过对四个Simulink-Stateflow模型生成的38,400个LLM变异体进行综合实证研究,我们证明LLM生成变异体的速度可达人工设计变异基线的13倍,同时产生的等价与重复变异体显著减少,且持续获得更优的变异体质量。此外,我们的分析表明,少样本提示与中低温度值结合能产生最佳效果。我们提供了开源原型工具并发布完整数据集,以促进研究可复现性并推动该领域未来发展。