Automated Test Generation for Medical Rules Web Services: A Case Study at the Cancer Registry of Norway

from arxiv, 12 pages, 2 figures, 5 tables; accepted to the industry track of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE '23)

The Cancer Registry of Norway (CRN) collects, curates, and manages data related to cancer patients in Norway, supported by an interactive, human-in-the-loop, socio-technical decision support software system. Automated software testing of this software system is inevitable; however, currently, it is limited in CRN's practice. To this end, we present an industrial case study to evaluate an AI-based system-level testing tool, i.e., EvoMaster, in terms of its effectiveness in testing CRN's software system. In particular, we focus on GURI, CRN's medical rule engine, which is a key component at the CRN. We test GURI with EvoMaster's black-box and white-box tools and study their test effectiveness regarding code coverage, errors found, and domain-specific rule coverage. The results show that all EvoMaster tools achieve a similar code coverage; i.e., around 19% line, 13% branch, and 20% method; and find a similar number of errors; i.e., 1 in GURI's code. Concerning domain-specific coverage, EvoMaster's black-box tool is the most effective in generating tests that lead to applied rules; i.e., 100% of the aggregation rules and between 12.86% and 25.81% of the validation rules; and to diverse rule execution results; i.e., 86.84% to 89.95% of the aggregation rules and 0.93% to 1.72% of the validation rules pass, and 1.70% to 3.12% of the aggregation rules and 1.58% to 3.74% of the validation rules fail. We further observe that the results are consistent across 10 versions of the rules. Based on these results, we recommend using EvoMaster's black-box tool to test GURI since it provides good results and advances the current state of practice at the CRN. Nonetheless, EvoMaster needs to be extended to employ domain-specific optimization objectives to improve test effectiveness further. Finally, we conclude with lessons learned and potential research directions, which we believe are generally applicable.

翻译：挪威癌症登记处（CRN）在交互式、人机协同的社会技术决策支持软件系统支持下，收集、整理并管理挪威癌症患者的相关数据。对此软件系统进行自动化软件测试必不可少，但目前CRN实践中此类测试有限。为此，我们开展了一项工业案例研究，评估基于AI的系统级测试工具EvoMaster在测试CRN软件系统时的有效性。具体而言，我们聚焦于CRN的关键组件——医疗规则引擎GURI。我们使用EvoMaster的黑盒和白盒工具测试GURI，并从代码覆盖率、发现的错误以及领域特定规则覆盖率等方面研究其测试有效性。结果表明，所有EvoMaster工具均实现了相似的代码覆盖率（即约19%行覆盖率、13%分支覆盖率、20%方法覆盖率），并发现了相似数量的错误（即GURI代码中的1个错误）。关于领域特定覆盖率，EvoMaster的黑盒工具在生成触发已应用规则的测试方面最有效（即100%的聚合规则和12.86%至25.81%的验证规则），并生成了多样化的规则执行结果（即86.84%至89.95%的聚合规则和0.93%至1.72%的验证规则通过测试，1.70%至3.12%的聚合规则和1.58%至3.74%的验证规则失败）。我们进一步观察到，这些结果在规则的10个版本中保持一致。基于这些结果，我们建议使用EvoMaster的黑盒工具测试GURI，因其能提供良好结果并推动CRN当前实践的发展。尽管如此，仍需扩展EvoMaster，引入领域特定优化目标以进一步提高测试有效性。最后，我们总结了经验教训并提出了可能的研究方向，相信这些内容具有普遍适用性。