Silent Bugs in Deep Learning Frameworks: An Empirical Study of Keras and TensorFlow

Deep Learning (DL) frameworks are now widely used, simplifying the creation of complex models as well as their integration to various applications even to non DL experts. However, like any other programs, they are prone to bugs. This paper deals with the subcategory of bugs named silent bugs: they lead to wrong behavior but they do not cause system crashes or hangs, nor show an error message to the user. Such bugs are even more dangerous in DL applications and frameworks due to the "black-box" and stochastic nature of the systems (the end user can not understand how the model makes decisions). This paper presents the first empirical study of Keras and TensorFlow silent bugs, and their impact on users' programs. We extracted closed issues related to Keras from the TensorFlow GitHub repository. Out of the 1,168 issues that we gathered, 77 were reproducible silent bugs affecting users' programs. We categorized the bugs based on the effects on the users' programs and the components where the issues occurred, using information from the issue reports. We then derived a threat level for each of the issues, based on the impact they had on the users' programs. To assess the relevance of identified categories and the impact scale, we conducted an online survey with 103 DL developers. The participants generally agreed with the significant impact of silent bugs in DL libraries and acknowledged our findings (i.e., categories of silent bugs and the proposed impact scale). Finally, leveraging our analysis, we provide a set of guidelines to facilitate safeguarding against such bugs in DL frameworks.

翻译：深度学习（DL）框架现已广泛应用，即便非深度学习专家也能借此简化复杂模型的构建及将其集成到各类应用中。然而，与其他程序一样，它们也难免存在错误。本文聚焦于一类名为"静默错误"的子类错误：这类错误会导致系统行为异常，但不会引发系统崩溃、死机，也不会向用户显示错误信息。由于深度学习系统具有"黑箱"与随机性特征（终端用户无法理解模型如何做出决策），此类错误在深度学习应用和框架中尤为危险。本文首次对 Keras 与 TensorFlow 中的静默错误及其对用户程序的影响进行了实证研究。我们从 TensorFlow GitHub 仓库中提取了与 Keras 相关的已关闭问题。在收集的 1,168 个问题中，有 77 个属于可复现且影响用户程序的静默错误。我们根据错误对用户程序的影响以及问题发生的组件，利用问题报告中的信息对错误进行了分类。随后，基于这些错误对用户程序的影响程度，我们为每个问题推导出了威胁等级。为评估所分类别与影响等级的相关性，我们面向 103 名深度学习开发者进行了在线调查。参与者普遍认同深度学习库中静默错误的显著影响，并认可了我们的发现（即静默错误的分类及所提出的影响等级）。最后，基于我们的分析，我们提出了一套准则，以帮助防范深度学习框架中的此类错误。