深度学习库中的检查器缺陷检测与修复 (Checker Bug Detection and Repair in Deep Learning Libraries)

Checker bugs in Deep Learning (DL) libraries are critical yet not well-explored. These bugs are often concealed in the input validation and error-checking code of DL libraries and can lead to silent failures, incorrect results, or unexpected program behavior in DL applications. Despite their potential to significantly impact the reliability and performance of DL-enabled systems built with these libraries, checker bugs have received limited attention. We present the first comprehensive study of DL checker bugs in two widely-used DL libraries, i.e., TensorFlow and PyTorch. Initially, we automatically collected a dataset of 2,418 commits from TensorFlow and PyTorch repositories on GitHub from Sept. 2016 to Dec. 2023 using specific keywords related to checker bugs. Through manual inspection, we identified 527 DL checker bugs. Subsequently, we analyzed these bugs from three perspectives, i.e., root causes, symptoms, and fixing patterns. Using the knowledge gained via root cause analysis of checker bugs, we further propose TensorGuard, a proof-of-concept RAG-based LLM-based tool to detect and fix checker bugs in DL libraries via prompt engineering a series of ChatGPT prompts. We evaluated TensorGuard's performance on a test dataset that includes 92 buggy and 135 clean checker-related changes in TensorFlow and PyTorch from January 2024 to July 2024. Our results demonstrate that TensorGuard has high average recall (94.51\%) using Chain of Thought prompting, a balanced performance between precision and recall using Zero-Shot prompting and Few-Shot prompting strategies. In terms of patch generation, TensorGuard achieves an accuracy of 11.1\%, which outperforms the state-of-the-art bug repair baseline by 2\%. We have also applied TensorGuard on the latest six months' checker-related changes (493 changes) of the JAX library from Google, which resulted in the detection of 64 new checker bugs.

翻译：深度学习库中的检查器缺陷至关重要却尚未被充分探索。这类缺陷通常隐藏在深度学习库的输入验证与错误检查代码中，可能导致深度学习应用出现静默故障、错误结果或意外程序行为。尽管检查器缺陷可能严重影响基于这些库构建的深度学习系统的可靠性与性能，但其获得的关注度有限。本文针对两种广泛使用的深度学习库（TensorFlow与PyTorch）中的检查器缺陷开展了首次系统性研究。我们首先通过特定关键词，自动收集了GitHub上TensorFlow和PyTorch代码库从2016年9月至2023年12月期间的2,418个提交记录。经人工核查，共识别出527个深度学习检查器缺陷。随后，我们从根本原因、表现形式和修复模式三个维度对这些缺陷进行了分析。基于对检查器缺陷根本原因分析所获得的知识，我们进一步提出了TensorGuard——一个基于检索增强生成（RAG）与大型语言模型（LLM）的概念验证工具，通过设计一系列ChatGPT提示词工程来检测并修复深度学习库中的检查器缺陷。我们在包含2024年1月至7月期间TensorFlow和PyTorch中92个缺陷性及135个正常检查器相关代码变更的测试数据集上评估了TensorGuard的性能。实验结果表明：使用思维链提示策略时，TensorGuard的平均召回率达到94.51%；使用零样本提示和少样本提示策略时，则在精确率与召回率间取得了平衡表现。在补丁生成方面，TensorGuard实现了11.1%的准确率，较当前最优缺陷修复基线方法提升2%。我们还将TensorGuard应用于Google的JAX库最近六个月（493个变更）的检查器相关代码变更，成功检测出64个新的检查器缺陷。