Deep learning is widely used to uncover hidden patterns in large code corpora. To achieve this, constructing a format that captures the relevant characteristics and features of source code is essential. Graph-based representations have gained attention for their ability to model structural and semantic information. However, existing tools lack flexibility in constructing graphs across different programming languages, limiting their use. Additionally, the output of these tools often lacks interoperability and results in excessively large graphs, making graph-based neural networks training slower and less scalable. We introduce CONCORD, a domain-specific language to build customizable graph representations. It implements reduction heuristics to reduce graphs' size complexity. We demonstrate its effectiveness in code smell detection as an illustrative use case and show that: first, CONCORD can produce code representations automatically per the specified configuration, and second, our heuristics can achieve comparable performance with significantly reduced size. CONCORD will help researchers a) create and experiment with customizable graph-based code representations for different software engineering tasks involving DL, b) reduce the engineering work to generate graph representations, c) address the issue of scalability in GNN models, and d) enhance the reproducibility of experiments in research through a standardized approach to code representation and analysis.
翻译:深度学习广泛应用于从大型代码库中挖掘隐藏模式。为此,构建能够捕获源代码相关特征和特性的格式至关重要。基于图的表示因其建模结构和语义信息的能力而受到关注。然而,现有工具在不同编程语言中构建图时缺乏灵活性,限制了其应用。此外,这些工具的输出常缺乏互操作性且生成过大的图,导致基于图的神经网络训练速度变慢且可扩展性降低。我们提出CONCORD,一种用于构建可定制图表示领域的领域特定语言。它实现了简约启发式算法以降低图的规模复杂度。我们以代码异味检测作为示例用例展示了其有效性:首先,CONCORD可根据指定配置自动生成代码表示;其次,我们的启发式算法可在显著降低规模的同时实现相当的性能。CONCORD将帮助研究人员:a)为涉及深度学习的各类软件工程任务创建并实验可定制的基于图的代码表示,b)减少生成图表示所需的工程工作量,c)解决GNN模型的可扩展性问题,d)通过标准化的代码表示与分析方式提升实验的可重复性。