Deep Learning (DL) compilers are widely adopted to optimize advanced DL models for efficient deployment on diverse hardware. Their quality has profound effect on the quality of compiled DL models. A recent bug study shows that the optimization of high-level intermediate representation (IR) is the most error-prone compilation stage. Bugs in this stage are accountable for 44.92% of the whole collected ones. However, existing testing techniques do not consider high-level optimization related features (e.g. high-level IR), and are therefore weak in exposing bugs at this stage. To bridge this gap, we propose HirGen, an automated testing technique that aims to effectively expose coding mistakes in the optimization of high-level IR. The design of HirGen includes 1) three coverage criteria to generate diverse and valid computational graphs; 2) full use of high-level IRs language features to generate diverse IRs; 3) three test oracles inspired from both differential testing and metamorphic testing. HirGen has successfully detected 21 bugs that occur at TVM, with 17 bugs confirmed and 12 fixed. Further, we construct four baselines using the state-of-the-art DL compiler fuzzers that can cover the high-level optimization stage. Our experiment results show that HirGen can detect 10 crashes and inconsistencies that cannot be detected by the baselines in 48 hours. We further validate the usefulness of our proposed coverage criteria and test oracles in evaluation.
翻译:深度学习(DL)编译器被广泛用于优化先进深度学习模型,以在不同硬件上高效部署。其质量直接影响编译后的深度学习模型质量。近期一项缺陷研究表明,高级中间表示(IR)的优化是编译过程中最易出错的阶段。该阶段缺陷占全部收集缺陷的44.92%。然而,现有测试技术未考虑高级优化相关特征(如高级IR),因此难以揭示该阶段缺陷。为弥补这一空白,我们提出HirGen,一种旨在有效发现高级IR优化中编码错误的自动化测试技术。HirGen的设计包括:1)三种覆盖准则以生成多样且有效的计算图;2)充分利用高级IR语言特征生成多样化IR;3)结合差分测试与蜕变测试思路的三种测试预言。HirGen成功检测到TVM中21个缺陷,其中17个已确认、12个已修复。此外,我们构建了四个基于最先进深度学习编译器模糊测试工具的基线方法,这些方法可覆盖高级优化阶段。实验结果表明,在48小时内,HirGen能检测到基线方法无法发现的10个崩溃与不一致问题。我们进一步通过实验验证了所提覆盖准则与测试预言的有效性。