Pre-trained Large Language Models (LLM) have achieved remarkable successes in several domains. However, code-oriented LLMs are heavy in computational complexity, and quadratically with the length of the input. Toward simplifying the input program of an LLM, the state-of-the-art approach has the strategies to filter the input code tokens based on the attention scores given by the LLM. The decision to simplify the input should not rely on the attention patterns of an LLM, as these patterns are influenced by both the model architecture and the pre-training dataset. Since the model and dataset are part of the solution domain, not the problem domain where the input belongs, the outcome may differ when the model is pre-trained on a different dataset. We propose SlimCode, a model-agnostic code simplification solution for LLMs that depends on the nature of input code tokens. As an empirical study on the LLMs including CodeBERT, CodeT5, and GPT-4 for two main tasks: code search and summarization, we reported that 1) the removal ratio of code has a linear-like relation with the saving ratio on training time, 2) the impact of categorized tokens on code simplification can vary significantly, 3) the impact of categorized tokens on code simplification is task-specific but model-agnostic, and 4) the above findings hold for the paradigm-prompt engineering and interactive in-context learning. The empirical results showed that SlimCode can improve the state-of-the-art technique by 9.46% and 5.15% in terms of MRR and BLEU score on code search and summarization. Moreover, SlimCode is 133 times faster than the state-of-the-art approach. Additionally, SlimCode can reduce the cost of invoking GPT-4 by up to 24% per API query, while still producing comparable results to those with the original code.
翻译:预训练大语言模型(LLM)已在多个领域取得显著成功。然而,面向代码的LLM计算复杂度高,且随输入长度呈二次方增长。在简化LLM输入程序方面,现有最优方法基于LLM提供的注意力分数来过滤输入代码词元。但简化输入的决定不应依赖于LLM的注意力模式,因为这些模式同时受模型架构和预训练数据集影响。由于模型和数据集属于解域而非输入所属的问题域,当模型在不同数据集上预训练时,结果可能不同。我们提出SlimCode——一种面向LLM的模型无关代码简化方法,其依赖于输入代码词元的自然特性。基于对CodeBERT、CodeT5和GPT-4等LLM在代码搜索与摘要两个主要任务上的实证研究,我们报告:1)代码移除率与训练时间节省率呈近似线性关系;2)分类词元对代码简化的影响差异显著;3)分类词元对代码简化的影响具有任务特异性但模型无关性;4)上述发现同样适用于提示工程和交互式情境学习范式。实证结果表明,SlimCode在代码搜索和摘要任务上,分别将MRR和BLEU评分提升9.46%和5.15%,优于现有最优技术。此外,SlimCode的处理速度比现有方法快133倍。更值得注意的是,SlimCode可将每次API查询调用GPT-4的成本降低24%,同时产生与原始代码相当的结果。