IRCoCo: Immediate Rewards-Guided Deep Reinforcement Learning for Code Completion

Code completion aims to enhance programming productivity by predicting potential code based on the current programming context. Recently, pretrained language models (LMs) have become prominent in this field. Various approaches have been proposed to fine-tune LMs using supervised fine-tuning (SFT) techniques for code completion. However, the inherent exposure bias of these models can cause errors to accumulate early in the sequence completion, leading to even more errors in subsequent completions. To address this problem, deep reinforcement learning (DRL) is an alternative technique for fine-tuning LMs for code completion, which can improve the generalization capabilities and overall performance. Nevertheless, integrating DRL-based strategies into code completion faces two major challenges: 1) The dynamic nature of the code context requires the completion model to quickly adapt to changes, which poses difficulties for conventional DRL strategies that focus on delayed rewarding of the final code state. 2) It is difficult to evaluate the correctness of partial code, thus the reward redistribution-based strategies cannot be adapted to code completion. To tackle these challenges, we propose IRCoCo, a code completion-specific DRL-based fine-tuning framework. This framework is designed to provide immediate rewards as feedback for detecting dynamic context changes arising from continuous edits during code completion. With the aid of immediate feedback, the fine-tuned LM can gain a more precise understanding of the current context, thereby enabling effective adjustment of the LM and optimizing code completion in a more refined manner. Experimental results demonstrate that fine-tuning pretrained LMs with IRCoCo leads to significant improvements in the code completion task, outperforming both SFT-based and other DRL-based baselines.

翻译：代码补全旨在通过基于当前编程上下文预测潜在代码来提升编程效率。近年来，预训练语言模型在该领域占据主导地位。研究者提出了多种方法，利用监督微调技术对语言模型进行代码补全任务的微调。然而，这类模型固有的曝光偏差会导致序列补全早期阶段的误差累积，进而引发后续补全的更多错误。针对此问题，深度强化学习成为微调语言模型实现代码补全的替代技术，可提升泛化能力与整体性能。但将基于深度强化学习的策略融入代码补全面临两大挑战：1)代码上下文的动态性要求补全模型快速适应变化，传统深度强化学习策略聚焦于最终代码状态的延迟奖励，难以应对此类场景；2)部分代码的正确性评估困难，导致基于奖励再分配的策略无法适用于代码补全。为应对这些挑战，我们提出了IRCoCo——一个专为代码补全设计的深度强化学习微调框架。该框架通过提供即时奖励作为反馈，能检测代码补全过程中持续编辑引发的动态上下文变化。借助即时反馈，微调后的语言模型可更精确地理解当前上下文，从而有效调整模型参数并更精细地优化代码补全。实验结果表明，采用IRCoCo微调预训练语言模型可显著提升代码补全任务性能，优于基于监督微调及其他深度强化学习方法的基线模型。