The rising popularity of Large Language Models (LLMs) has motivated exploring their use in code-related tasks. Code LLMs with more than millions of parameters are trained on a massive amount of code in different Programming Languages (PLs). Such models are used for automating various Software Engineering (SE) tasks using prompt engineering. However, given the very large size of industry-scale project files, a major issue of these LLMs is their limited context window size, motivating the question of "Can these LLMs process very large files and can we effectively perform prompt engineering?". Code translation aims to convert source code from one PL to another. In this work, we assess the effect of method-level program decomposition on context window of LLMs and investigate how this approach can enable translation of very large files which originally could not be done due to out-of-context issue. Our observations from 20 well-known java projects and approximately 60K methods suggest that method-level program decomposition significantly improves the limited context window problem of LLMs by 99.5%. Furthermore, our empirical analysis indicate that with method-level decomposition, each input fragment on average only consumes 5% of the context window, leaving more context space for prompt engineering and the output. Finally, we investigate the effectiveness of a Call Graph (CG) approach for translating very large files when doing method-level program decomposition.
翻译:大语言模型(LLM)的日益普及推动了对其在代码相关任务中应用的探索。拥有数百万参数以上的代码LLM基于不同编程语言(PL)的海量代码进行训练。这类模型通过提示工程被用于自动化各类软件工程(SE)任务。然而,面对工业级项目文件的庞大体量,这些LLM面临的主要问题是其有限的上下文窗口大小,这引发了一个关键问题:"这些LLM能否处理超大型文件?我们能否有效进行提示工程?"代码翻译旨在将源代码从一种编程语言转换为另一种。在本研究中,我们评估了方法级程序分解对LLM上下文窗口的影响,并探讨该方法如何实现对原始因上下文溢出问题无法处理的大文件进行翻译。通过对20个知名Java项目及约6万个方法的观察,我们发现方法级程序分解使LLM有限的上下文窗口问题得到99.5%的显著改善。此外,实证分析表明,采用方法级分解后,每个输入片段平均仅消耗5%的上下文窗口,为提示工程和输出留下了更多上下文空间。最后,我们研究了在执行方法级程序分解时,采用调用图(CG)方法翻译超大型文件的有效性。