Large Language Models for code often entail significant computational complexity, which grows significantly with the length of the input code sequence. We propose LeanCode for code simplification to reduce training and prediction time, leveraging code contexts in utilizing attention scores to represent the tokens' importance. We advocate for the selective removal of tokens based on the average context-aware attention scores rather than average scores across all inputs. LeanCode uses the attention scores of `CLS' tokens within the encoder for classification tasks, such as code search. It also employs the encoder-decoder attention scores to determine token significance for sequence-to-sequence tasks like code summarization. Our evaluation shows LeanCode's superiority over the SOTAs DietCode and Slimcode, with improvements of 60% and 16% for code search, and 29% and 27% for code summarization, respectively.
翻译:面向代码的大语言模型通常具有显著的计算复杂度,且该复杂度随输入代码序列长度的增加而急剧增长。我们提出了用于代码简化的LeanCode方法,以降低训练和预测时间。该方法利用代码上下文,通过注意力分数来表征标记的重要性。我们主张基于上下文感知注意力分数的平均值,而非所有输入的平均分数,来选择性地移除标记。对于分类任务(如代码搜索),LeanCode利用编码器中`CLS`标记的注意力分数。对于序列到序列任务(如代码摘要),则采用编码器-解码器注意力分数来确定标记的重要性。我们的评估表明,LeanCode在代码搜索任务上分别比当前最优方法DietCode和Slimcode提升了60%和16%,在代码摘要任务上分别提升了29%和27%,展现了其优越性。