Retrieval-Augmented Generation (RAG) enhances coding tasks by incorporating retrieved code examples into prompts. However, lengthy prompts, often exceeding tens of thousands of tokens, introduce challenges related to limited context windows of language models (LMs) and high computational costs. Existing prompt compression techniques focus on natural language, lacking tailored solutions for code. To address the gap, we propose CodePromptZip, a framework that compresses code examples before integrating into RAG workflows. Our framework employs a type-aware, priority-driven strategy to construct training samples for training code compression model. By using program analysis, we identify token types (e.g., Identifier) and perform ablation analysis to rank their removal priorities based on their impact on task performance. We then train a small LM as the compressor on these samples, enabling flexible compression conditioned on specified ratios while minimizing performance degradation. Specially, the compressor is augmented with a copy mechanism, allowing tokens to be directly copied from the original code snippets. Evaluation results show that CodePromptZip surpasses SOTA entropy-based and distillation-based baselines, improving by 23.4%, 28.7%, and 8.7% over the best baseline for Assertion Generation, Bugs2Fix, and Code Suggestion, respectively.
翻译:检索增强生成(RAG)通过将检索到的代码示例融入提示来增强编码任务。然而,冗长的提示(通常超过数万个标记)带来了语言模型(LM)有限上下文窗口和高计算成本的挑战。现有的提示压缩技术主要针对自然语言,缺乏针对代码的定制化解决方案。为弥补这一不足,我们提出了CodePromptZip框架,该框架在将代码示例集成到RAG工作流之前对其进行压缩。我们的框架采用一种类型感知、优先级驱动的策略来构建训练样本,以训练代码压缩模型。通过程序分析,我们识别标记类型(例如标识符),并进行消融分析,根据其对任务性能的影响来排序其移除优先级。随后,我们在这些样本上训练一个小型LM作为压缩器,使其能够根据指定比率进行灵活压缩,同时最大限度地减少性能下降。特别地,该压缩器通过复制机制增强,允许标记直接从原始代码片段中复制。评估结果表明,CodePromptZip超越了基于熵和基于蒸馏的最先进基线,在断言生成、Bugs2Fix和代码建议任务上分别比最佳基线提升了23.4%、28.7%和8.7%。