Retrieval-augmented generation has emerged as one of the most effective approaches for code completion, particularly when context from a surrounding repository is essential. However, incorporating context significantly extends sequence length, leading to slower inference - a critical limitation for interactive settings such as IDEs. In this work, we introduce LlavaCode, a framework that compresses code into compact, semantically rich representations interpretable by code LLM, enhancing generation quality while reducing the retrieved context to only a few compressed single-token vectors. Using a small projector module we can significantly increase the EM and ES metrics of coding model with negligible latency increase. Our experiments demonstrate that compressed context enables 20-38% reduction in Time-to-First-Token (TTFT) on line completion tasks compared to full-RAG pipelines.
翻译:检索增强生成已成为代码补全领域最有效的方法之一,尤其在需要结合代码库上下文时。然而,引入上下文会显著增加序列长度,导致推理速度下降——这对IDE等交互式环境构成了关键限制。本文提出LLaVaCode框架,该框架将代码压缩为紧凑且语义丰富的表示形式,这些表示可由代码大语言模型解析,在将检索上下文缩减为仅数个压缩单令牌向量的同时,提升了生成质量。通过小型投影器模块,我们能在几乎不增加延迟的情况下,显著提升代码模型的精确匹配(EM)与编辑相似度(ES)指标。实验表明,在行级补全任务中,压缩上下文相较于完整检索增强生成流程能实现20-38%的首令牌生成时间(TTFT)缩减。