Deep Learning (DL) compilers are widely adopted to optimize advanced DL models for efficient deployment on diverse hardware. Their quality has profound effect on the quality of compiled DL models. A recent bug study shows that the optimization of high-level intermediate representation (IR) is the most error-prone compilation stage. Bugs in this stage are accountable for 44.92% of the whole collected ones. However, existing testing techniques do not consider high-level optimization related features (e.g. high-level IR), and are therefore weak in exposing bugs at this stage. To bridge this gap, we propose HirGen, an automated testing technique that aims to effectively expose coding mistakes in the optimization of high-level IR. The design of HirGen includes 1) three coverage criteria to generate diverse and valid computational graphs; 2) full use of high-level IRs language features to generate diverse IRs; 3) three test oracles inspired from both differential testing and metamorphic testing. HirGen has successfully detected 21 bugs that occur at TVM, with 17 bugs confirmed and 12 fixed. Further, we construct four baselines using the state-of-the-art DL compiler fuzzers that can cover the high-level optimization stage. Our experiment results show that HirGen can detect 10 crashes and inconsistencies that cannot be detected by the baselines in 48 hours. We further validate the usefulness of our proposed coverage criteria and test oracles in evaluation.
翻译:深度学习编译器被广泛用于优化高级深度学习模型,以实现在多样化硬件上的高效部署。其质量直接影响编译后深度学习模型的质量。最新缺陷研究表明,高级中间表示(IR)的优化是最易出错的编译阶段,该阶段的缺陷占全部收集缺陷的44.92%。然而现有测试技术未考虑高级优化相关特性(如高级IR),因此难以暴露该阶段的缺陷。为弥补这一空白,本文提出自动化测试技术HirGen,旨在有效暴露高级IR优化中的编码错误。HirGen的设计包括:1)三种覆盖率准则,用于生成多样化且有效的计算图;2)充分利用高级IR语言特性生成多样化IR;3)综合差异测试与蜕变测试启发的三种测试预言。HirGen已在TVM中成功检测21个缺陷,其中17个已确认,12个已修复。此外,我们基于可覆盖高级优化阶段的最新深度学习编译器模糊测试工具构建了四个基线。实验结果表明,HirGen在48小时内能检测出基线无法发现的10个崩溃与不一致性问题。我们进一步通过评估验证了所提覆盖率准则与测试预言的有效性。