CITADEL: Context Similarity Based Deep Learning Framework Bug Finding

With deep learning (DL) technology becoming an integral part of the new intelligent software, tools of DL framework testing and bug-finding are in high demand. Existing DL framework testing tools have limited coverage on bug types. For example, they lack the capability of finding performance bugs, which are critical for DL model training and inference regarding performance, economics, and the environment. This problem is challenging due to the difficulty of getting test oracles of performance bugs. Moreover, existing tools are inefficient, generating hundreds of test cases with few trigger bugs. In this paper, we propose CITADEL, a method that accelerates the finding of bugs in terms of efficiency and effectiveness. We observe that many DL framework bugs are similar due to the similarity of operators and algorithms belonging to the same family (e.g., Conv2D and Conv3D). Orthogonal to existing bug-finding tools, CITADEL aims to find new bugs that are similar to reported ones that have known test oracles. It works by first collecting existing bug reports and identifying problematic APIs. CITADEL defines context similarity to measure the similarity of DL framework API pairs and automatically generates test cases with oracles for APIs that are similar to the problematic APIs in existing bug reports. CITADEL respectively covers 1,436 PyTorch and 5,380 TensorFlow APIs and effectively detects 79 and 80 API bugs, among which 58 and 68 are new, and 36 and 58 have been confirmed, many of which, e.g., the 11 performance bugs cannot be detected by existing tools. Moreover, a remarkable 35.40% of the test cases generated by CITADEL can trigger bugs, which significantly transcends the ratios of 0.74%, 1.23%, and 3.90% exhibited by the state-of-the-art methods, DocTer, DeepREL, and TitanFuzz.

翻译：随着深度学习技术成为新型智能软件不可或缺的组成部分，对深度学习框架测试与缺陷发现工具的需求日益迫切。现有的深度学习框架测试工具在缺陷类型覆盖上存在局限。例如，它们缺乏发现性能缺陷的能力，而这类缺陷对于深度学习模型的训练和推理在性能、经济性和环境影响方面至关重要。由于难以获取性能缺陷的测试预言，该问题极具挑战性。此外，现有工具效率低下，往往生成数百个测试用例却仅能触发少量缺陷。本文提出CITADEL，一种在效率和效果上加速缺陷发现的方法。我们观察到，由于属于同一族的运算符和算法具有相似性，许多深度学习框架缺陷也表现出相似性。与现有缺陷发现工具正交，CITADEL旨在发现与已报告且具有已知测试预言的缺陷相似的新缺陷。其工作原理是：首先收集现有缺陷报告并识别有问题的API。CITADEL定义了上下文相似性以度量深度学习框架API对之间的相似度，并自动为与现有缺陷报告中有问题API相似的API生成带预言的测试用例。CITADEL分别覆盖了1,436个PyTorch API和5,380个TensorFlow API，有效检测出79个和80个API缺陷，其中58个和68个为新缺陷，36个和58个已获确认，且许多缺陷（例如11个性能缺陷）是现有工具无法检测的。此外，CITADEL生成的测试用例中有高达35.40%能够触发缺陷，显著超越了当前最先进方法DocTer、DeepREL和TitanFuzz所展示的0.74%、1.23%和3.90%的触发率。