Deep learning (DL) has attracted wide attention and has been widely deployed in recent years. As a result, more and more research efforts have been dedicated to testing DL libraries and frameworks. However, existing work largely overlooked one crucial component of any DL system, automatic differentiation (AD), which is the basis for the recent development of DL. To this end, we propose $\nabla$Fuzz, the first general and practical approach specifically targeting the critical AD component in DL libraries. Our key insight is that each DL library API can be abstracted into a function processing tensors/vectors, which can be differentially tested under various execution scenarios (for computing outputs/gradients with different implementations). We have implemented $\nabla$Fuzz as a fully automated API-level fuzzer targeting AD in DL libraries, which utilizes differential testing on different execution scenarios to test both first-order and high-order gradients, and also includes automated filtering strategies to remove false positives caused by numerical instability. We have performed an extensive study on four of the most popular and actively-maintained DL libraries, PyTorch, TensorFlow, JAX, and OneFlow. The result shows that $\nabla$Fuzz substantially outperforms state-of-the-art fuzzers in terms of both code coverage and bug detection. To date, $\nabla$Fuzz has detected 173 bugs for the studied DL libraries, with 144 already confirmed by developers (117 of which are previously unknown bugs and 107 are related to AD). Remarkably, $\nabla$Fuzz contributed 58.3% (7/12) of all high-priority AD bugs for PyTorch and JAX during a two-month period. None of the confirmed AD bugs were detected by existing fuzzers.
翻译:深度学习近年来受到广泛关注并已大规模部署。为此,越来越多的研究致力于测试深度学习库及其框架。然而,现有工作普遍忽略了深度学习系统中的关键组件——自动微分,而后者正是近期深度学习发展的基础。针对这一问题,我们提出$\nabla$Fuzz——首个专门针对深度学习库中自动微分组件设计的通用实用化方法。我们的核心洞见在于:每个深度学习库API均可抽象为处理张量/向量的函数,这些函数可在不同执行场景(计算输出/梯度时采用不同实现)下进行差异测试。我们实现了$\nabla$Fuzz这一全自动的API级模糊测试工具,它利用不同执行场景的差异测试同时检测一阶和高阶梯度,并包含自动化过滤策略以消除数值不稳定性导致的误报。我们对PyTorch、TensorFlow、JAX和OneFlow这四个最主流且持续维护的深度学习库进行了全面评估。结果表明,$\nabla$Fuzz在代码覆盖率和缺陷检测能力上均显著超越现有最优模糊测试工具。迄今为止,$\nabla$Fuzz已为所研究的深度学习库检测到173个缺陷,其中144个已获开发者确认(117个为先前未知缺陷,107个与自动微分相关)。值得关注的是,在为期两个月的研究中,$\nabla$Fuzz贡献了PyTorch和JAX全部高优先级自动微分缺陷的58.3%(7/12)。这些已确认的自动微分缺陷均未被现有模糊测试工具检测到。