Study patterns that models have learned has long been a focus of pattern recognition research. Explaining what patterns are discovered from training data, and how patterns are generalized to unseen data are instrumental to understanding and advancing the pattern recognition methods. Unfortunately, the vast majority of the application domains deal with continuous data (i.e. statistical in nature) out of which extracted patterns can not be formally defined. For example, in image classification, there does not exist a principle definition for a label of cat or dog. Even in natural language, the meaning of a word can vary with the context it is surrounded by. Unlike the aforementioned data format, programs are a unique data structure with a well-defined syntax and semantics, which creates a golden opportunity to formalize what models have learned from source code. This paper presents the first formal definition of patterns discovered by code summarization models (i.e. models that predict the name of a method given its body), and gives a sound algorithm to infer a context-free grammar (CFG) that formally describes the learned patterns. We realize our approach in PATIC which produces CFGs for summarizing the patterns discovered by code summarization models. In particular, we pick two prominent instances, code2vec and code2seq, to evaluate PATIC. PATIC shows that the patterns extracted by each model are heavily restricted to local, and syntactic code structures with little to none semantic implication. Based on these findings, we present two example uses of the formal definition of patterns: a new method for evaluating the robustness and a new technique for improving the accuracy of code summarization models. Our work opens up this exciting, new direction of studying what models have learned from source code.
翻译:研究模型所学到的模式一直是模式识别领域的焦点。解释模型从训练数据中发现了哪些模式,以及这些模式如何泛化到未见数据,对于理解和改进模式识别方法至关重要。然而,绝大多数应用领域处理的是连续数据(即本质上是统计性的),从中提取的模式无法被明确定义。例如,在图像分类中,不存在为“猫”或“狗”标签提供原则性定义的标准。即使在自然语言中,一个词的含义也会随其上下文变化。与上述数据格式不同,程序是一种具有明确定义语法和语义的独特数据结构,这为形式化描述模型从源代码中学到的内容提供了绝佳机会。本文首次对代码摘要模型(即根据方法体预测方法名称的模型)发现的模式给出了形式化定义,并提出了一种可靠算法来推导出形式化描述所学模式的上下文无关文法(CFG)。我们通过PATIC工具实现了该方法,该工具可生成CFG以总结代码摘要模型发现的模式。具体而言,我们选取了两个代表性模型——code2vec和code2seq——来评估PATIC。PATIC表明,每个模型提取的模式高度局限于局部和句法性代码结构,几乎不含语义含义。基于这些发现,我们提出了两种形式化模式定义的应用:一种评估模型鲁棒性的新方法,以及一种提升代码摘要模型准确率的新技术。我们的工作为研究模型从源代码中学到了什么这一激动人心的新方向打开了大门。