Testing Deep Learning Libraries via Neurosymbolic Constraint Learning

Deep Learning (DL) libraries (e.g., PyTorch) are popular in AI development. These libraries are complex and contain bugs. Researchers have proposed various bug-finding techniques for such libraries. Yet, there is much room for improvement. A key challenge in testing DL libraries is the lack of API specifications. Prior testing approaches often inaccurately model the input specifications of DL APIs, resulting in missed valid inputs that could reveal bugs or false alarms due to invalid inputs. To address this challenge, we develop Centaur -- the first neurosymbolic technique to test DL library APIs using dynamically learned input constraints. Centaur leverages the key idea that formal API constraints can be learned from a small number of automatically generated seed inputs, and that the learned constraints can be solved using SMT solvers to generate valid and diverse test inputs. We develop a novel grammar that represents first-order logic formulae over API parameters and expresses tensor-related properties (e.g., shape, data types) as well as relational properties between parameters. We use the grammar to guide a Large Language Model (LLM) to enumerate syntactically correct candidate rules, validated using seed inputs. Further, we develop a custom refinement strategy to prune the set of learned rules to eliminate spurious or redundant rules. We use the learned constraints to systematically generate valid and diverse inputs by integrating SMT-based solving with randomized sampling. We evaluate Centaur for testing PyTorch and TensorFlow. Our results show that Centaur's constraints have a recall of 94.0% and a precision of 94.0% on average. In terms of coverage, Centaur covers 203, 150, and 9,608 more branches than TitanFuzz, ACETest and Pathfinder, respectively. Using Centaur, we also detect 26 new bugs in PyTorch and TensorFlow, 18 of which are confirmed.

翻译：深度学习（DL）库（如 PyTorch）在人工智能开发中广泛应用。这些库结构复杂且存在缺陷。研究人员已针对此类库提出了多种错误检测技术，但仍有很大改进空间。测试深度学习库的一个关键挑战在于缺乏 API 规范。现有测试方法往往无法准确建模 DL API 的输入规范，导致可能揭示错误的有效输入被遗漏，或因无效输入而产生误报。为应对这一挑战，我们开发了 Centaur——首个利用动态学习输入约束来测试深度学习库 API 的神经符号技术。Centaur 的核心思想是：可以从少量自动生成的种子输入中学习形式化的 API 约束，并利用 SMT 求解器对学习到的约束进行求解，从而生成有效且多样化的测试输入。我们设计了一种新颖的语法，用于表示 API 参数上的一阶逻辑公式，既能表达张量相关属性（如形状、数据类型），也能表达参数间的关联属性。我们利用该语法引导大型语言模型（LLM）枚举语法正确的候选规则，并通过种子输入进行验证。此外，我们开发了定制化的精炼策略，对学习到的规则集进行剪枝，以消除虚假或冗余规则。通过将基于 SMT 的求解与随机采样相结合，我们利用学习到的约束系统性地生成有效且多样化的输入。我们在 PyTorch 和 TensorFlow 上对 Centaur 进行了评估。实验结果表明，Centaur 学习到的约束平均召回率达到 94.0%，精确率达到 94.0%。在代码覆盖率方面，Centaur 分别比 TitanFuzz、ACETest 和 Pathfinder 多覆盖 203、150 和 9,608 个分支。利用 Centaur，我们在 PyTorch 和 TensorFlow 中检测到 26 个新错误，其中 18 个已得到官方确认。