Encoding Version History Context for Better Code Representation

from arxiv, 5 pages (plus 1 for references), 1 figure, 3 tables, paper was accepted to 21st International Conference on Mining Software Repositories (MSR 2024)

With the exponential growth of AI tools that generate source code, understanding software has become crucial. When developers comprehend a program, they may refer to additional contexts to look for information, e.g. program documentation or historical code versions. Therefore, we argue that encoding this additional contextual information could also benefit code representation for deep learning. Recent papers incorporate contextual data (e.g. call hierarchy) into vector representation to address program comprehension problems. This motivates further studies to explore additional contexts, such as version history, to enhance models' understanding of programs. That is, insights from version history enable recognition of patterns in code evolution over time, recurring issues, and the effectiveness of past solutions. Our paper presents preliminary evidence of the potential benefit of encoding contextual information from the version history to predict code clones and perform code classification. We experiment with two representative deep learning models, ASTNN and CodeBERT, to investigate whether combining additional contexts with different aggregations may benefit downstream activities. The experimental result affirms the positive impact of combining version history into source code representation in all scenarios; however, to ensure the technique performs consistently, we need to conduct a holistic investigation on a larger code base using different combinations of contexts, aggregation, and models. Therefore, we propose a research agenda aimed at exploring various aspects of encoding additional context to improve code representation and its optimal utilisation in specific situations.

翻译：随着生成源代码的AI工具呈指数级增长，理解软件变得至关重要。当开发者理解程序时，可能会参考额外的上下文以查找信息，例如程序文档或历史代码版本。因此，我们认为，编码这些额外的上下文信息也可能有益于深度学习中的代码表示。近期研究将上下文数据（如调用层次）纳入向量表示以解决程序理解问题，这激励了进一步探索版本历史等额外上下文以增强模型对程序理解的研究。换言之，从版本历史中获得的洞察能够识别代码随时间演化的模式、反复出现的问题以及过去解决方案的有效性。本文提供了初步证据，表明编码来自版本历史的上下文信息对于预测代码克隆和执行代码分类具有潜在益处。我们实验了两种代表性深度学习模型ASTNN和CodeBERT，以探究将额外上下文与不同聚合方式结合是否有利于下游任务。实验结果表明，在所有场景下，将版本历史融入源代码表示均产生了积极影响；然而，为确保该技术表现一致，我们需要在更大规模的代码库上，使用不同的上下文、聚合方式和模型组合进行全面研究。因此，我们提出一项研究议程，旨在探索编码额外上下文的多个方面，以改进代码表示及其在特定情境中的最优利用。