DLLens: Testing Deep Learning Libraries via LLM-aided Synthesis

Testing is a major approach to ensuring the quality of deep learning (DL) libraries. Existing testing techniques commonly adopt differential testing to relieve the need for test oracle construction. However, these techniques are limited in finding implementations that offer the same functionality and generating diverse test inputs for differential testing. This paper introduces DLLens, a novel differential testing technique for DL library testing. Our insight is that APIs in different DL libraries are commonly designed to accomplish various computations for the same set of published DL algorithms. Although the mapping of these APIs is not often one-to-one, we observe that their computations can be mutually simulated after proper composition and adaptation. The use of these simulation counterparts facilitates differential testing for the detection of functional DL library bugs. Leveraging the insight, we propose DLLens as a novel mechanism that utilizes a large language model (LLM) to synthesize valid counterparts of DL library APIs. To generate diverse test inputs, DLLens incorporates a static analysis method aided by LLM to extract path constraints from all execution paths in each API and its counterpart's implementations. These path constraints are then used to guide the generation of diverse test inputs. We evaluate DLLens on two popular DL libraries, TensorFlow and PyTorch. Our evaluation shows that DLLens can synthesize counterparts for more than twice as many APIs found by state-of-the-art techniques on these libraries. Moreover, DLLens can extract 26.7% more constraints and detect 2.5 times as many bugs as state-of-the-art techniques. DLLens has successfully found 56 bugs in recent TensorFlow and PyTorch libraries. Among them, 41 are previously unknown, 39 of which have been confirmed by developers after reporting, and 19 of those confirmed bugs have been fixed by developers.

翻译：测试是保障深度学习（DL）库质量的主要手段。现有测试技术通常采用差分测试以降低测试预言构建的需求。然而，这些技术在寻找提供相同功能的实现以及为差分测试生成多样化测试输入方面存在局限。本文提出DLLens，一种用于深度学习库测试的新型差分测试技术。我们的核心洞察是：不同深度学习库中的API通常被设计为实现同一组已发布的深度学习算法的各种计算。尽管这些API之间的映射关系并非总是一一对应，但我们观察到，经过适当的组合与适配，它们的计算可以相互模拟。利用这些模拟对应关系有助于通过差分测试检测深度学习库中的功能错误。基于这一洞察，我们提出DLLens作为一种新机制，利用大语言模型（LLM）合成深度学习库API的有效对应实现。为生成多样化测试输入，DLLens结合了LLM辅助的静态分析方法，从每个API及其对应实现的全部执行路径中提取路径约束。这些路径约束随后被用于指导生成多样化的测试输入。我们在两个主流深度学习库TensorFlow和PyTorch上对DLLens进行了评估。实验结果表明，DLLens能够为这些库中现有最优技术所发现API数量的两倍以上合成对应实现。此外，DLLens可提取比现有最优技术多26.7%的路径约束，并检测到2.5倍数量的错误。DLLens已在最新版本的TensorFlow和PyTorch库中成功发现56个错误，其中41个为先前未知错误，在报告后已有39个得到开发者确认，且其中19个已获修复。