Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large Language Models

Pre-trained Large Language Models (LLM) have achieved remarkable successes in several domains. However, code-oriented LLMs are heavy in computational complexity, and quadratically with the length of the input. Toward simplifying the input program of an LLM, the state-of-the-art approach has the strategies to filter the input code tokens based on the attention scores given by the LLM. The decision to simplify the input should not rely on the attention patterns of an LLM, as these patterns are influenced by both the model architecture and the pre-training dataset. Since the model and dataset are part of the solution domain, not the problem domain where the input belongs, the outcome may differ when the model is pre-trained on a different dataset. We propose SlimCode, a model-agnostic code simplification solution for LLMs that depends on the nature of input code tokens. As an empirical study on the LLMs including CodeBERT, CodeT5, and GPT-4 for two main tasks: code search and summarization, we reported that 1) the removal ratio of code has a linear-like relation with the saving ratio on training time, 2) the impact of categorized tokens on code simplification can vary significantly, 3) the impact of categorized tokens on code simplification is task-specific but model-agnostic, and 4) the above findings hold for the paradigm-prompt engineering and interactive in-context learning. The empirical results showed that SlimCode can improve the state-of-the-art technique by 9.46% and 5.15% in terms of MRR and BLEU score on code search and summarization. Moreover, SlimCode is 133 times faster than the state-of-the-art approach. Additionally, SlimCode can reduce the cost of invoking GPT-4 by up to 24% per API query, while still producing comparable results to those with the original code.

翻译：预训练大语言模型（LLM）已在多个领域取得显著成功。然而，面向代码的LLM计算复杂度高，且随输入长度呈二次方增长。在简化LLM输入程序方面，现有最优方法基于LLM提供的注意力分数来过滤输入代码词元。但简化输入的决定不应依赖于LLM的注意力模式，因为这些模式同时受模型架构和预训练数据集影响。由于模型和数据集属于解域而非输入所属的问题域，当模型在不同数据集上预训练时，结果可能不同。我们提出SlimCode——一种面向LLM的模型无关代码简化方法，其依赖于输入代码词元的自然特性。基于对CodeBERT、CodeT5和GPT-4等LLM在代码搜索与摘要两个主要任务上的实证研究，我们报告：1）代码移除率与训练时间节省率呈近似线性关系；2）分类词元对代码简化的影响差异显著；3）分类词元对代码简化的影响具有任务特异性但模型无关性；4）上述发现同样适用于提示工程和交互式情境学习范式。实证结果表明，SlimCode在代码搜索和摘要任务上，分别将MRR和BLEU评分提升9.46%和5.15%，优于现有最优技术。此外，SlimCode的处理速度比现有方法快133倍。更值得注意的是，SlimCode可将每次API查询调用GPT-4的成本降低24%，同时产生与原始代码相当的结果。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日