Tensor Algebraic Property Skeletons: Amplifying Property-Based Testing for AI Compilers

Deep learning (DL) compilers such as TVM and ONNX-MLIR lower tensor computation graphs into optimized executables for target backends. Testing these AI compilers has made substantial progress in generating well-formed inputs in the context of fuzzing; however, such generation alone does not catch semantic drifts from algebraic invariants that graph transformations and optimizations are expected to preserve. While tensor algebra has been studied for decades, it has not been transformed into executable property-based tests (PBTs) for DL compilers because doing so requires jointly constructing operators, inputs, and test oracles. The central challenge is no longer generating well-formed inputs for fuzzing DL compilers, but bootstrapping executable PBTs with such inputs and oracles based on tensor algebra. We realize this vision in Propilot, an LLM-driven agentic property-based testing framework for DL compilers with GPT 5.5. First, Propilot represents tensor algebra knowledge as reusable property skeletons, each coupled with operator constraints, shape and value rules, and oracle templates. Second, given a target compiler, Propilot instantiates these skeletons into executable PBTs by generating paired tensor computation graphs, concrete tensor inputs, and expected semantic relations as oracles. Next, to prevent generated tests from degenerating into invalid or uninformative PBTs, Propilot validates each PBT candidate before execution for applicability and safety. Validation feedback, execution results, and coverage signals guide subsequent generation. We evaluate Propilot on TVM with 212 operators and 20 property skeletons, generating 4,579 PBTs. Compared with direct LLM-based PBT generation, Propilot reduces redundancy by 49% and eliminates invalid tests through explicit property skeletons. This effectiveness translates into finding semantic errors and numerical discrepancies.

翻译：深度学习（DL）编译器（如TVM和ONNX-MLIR）将张量计算图降级为面向目标后端的优化可执行文件。测试这些AI编译器在模糊测试中生成良好结构输入方面取得了显著进展；然而，仅靠这种生成无法捕捉到从代数不变量（即图变换和优化应保持的代数关系）中产生的语义漂移。尽管张量代数已研究数十年，但尚未转化为用于DL编译器的可执行基于性质的测试（PBTs），因为这样做需要联合构建算子、输入和测试预言机。核心挑战不再是生成用于模糊测试DL编译器的良好结构输入，而是基于张量代数使用这些输入和预言机引导可执行PBTs。我们通过Propilot实现了这一愿景，这是一款基于GPT 5.5、由LLM驱动的、面向DL编译器的智能体性质测试框架。首先，Propilot将张量代数知识表示为可重用的性质骨架，每个骨架包含算子约束、形状与数值规则以及预言机模板。其次，针对目标编译器，Propilot通过生成配对的张量计算图、具体张量输入以及作为预言机的预期语义关系，将这些骨架实例化为可执行PBTs。接着，为防止生成的测试退化为无效或无信息的PBTs，Propilot在执行前验证每个PBT候选者的适用性和安全性。验证反馈、执行结果和覆盖信号指导后续生成。我们在TVM上使用212个算子和20个性质骨架评估Propilot，生成了4,579个PBTs。与直接基于LLM的PBT生成相比，Propilot通过显式性质骨架将冗余减少了49%，并消除了无效测试。这种有效性转化为发现语义错误和数值差异。