IRCoCo: Immediate Rewards-Guided Deep Reinforcement Learning for Code Completion

Code completion aims to enhance programming productivity by predicting potential code based on the current programming context. Recently, pretrained language models (LMs) have become prominent in this field. Various approaches have been proposed to fine-tune LMs using supervised fine-tuning (SFT) techniques for code completion. However, the inherent exposure bias of these models can cause errors to accumulate early in the sequence completion, leading to even more errors in subsequent completions. To address this problem, deep reinforcement learning (DRL) is an alternative technique for fine-tuning LMs for code completion, which can improve the generalization capabilities and overall performance. Nevertheless, integrating DRL-based strategies into code completion faces two major challenges: 1) The dynamic nature of the code context requires the completion model to quickly adapt to changes, which poses difficulties for conventional DRL strategies that focus on delayed rewarding of the final code state. 2) It is difficult to evaluate the correctness of partial code, thus the reward redistribution-based strategies cannot be adapted to code completion. To tackle these challenges, we propose IRCoCo, a code completion-specific DRL-based fine-tuning framework. This framework is designed to provide immediate rewards as feedback for detecting dynamic context changes arising from continuous edits during code completion. With the aid of immediate feedback, the fine-tuned LM can gain a more precise understanding of the current context, thereby enabling effective adjustment of the LM and optimizing code completion in a more refined manner. Experimental results demonstrate that fine-tuning pretrained LMs with IRCoCo leads to significant improvements in the code completion task, outperforming both SFT-based and other DRL-based baselines.

翻译：代码补全旨在通过基于当前编程上下文预测潜在代码来提升编程效率。近年来，预训练语言模型在该领域占据重要地位。已有多种方法采用监督微调技术对语言模型进行代码补全任务的微调。然而，这类模型固有的曝光偏差会导致序列补全早期错误累积，进而引发后续补全的更多错误。为解决该问题，深度强化学习成为微调语言模型用于代码补全的替代技术，可提升泛化能力和整体性能。但将基于强化学习的策略融入代码补全面临两大挑战：1）代码上下文的动态特性要求补全模型快速适应变化，这对注重最终代码状态延迟奖励的传统强化学习策略构成困难；2）部分代码的正确性难以评估，导致基于奖励再分配的策略无法适配代码补全。为应对这些挑战，我们提出IRCoCo——一种针对代码补全的强化学习微调框架。该框架通过提供即时奖励作为反馈信号，检测代码补全过程中持续编辑引发的动态上下文变化。借助即时反馈，微调后的语言模型能更精准理解当前上下文，从而有效调整语言模型，以更精细的方式优化代码补全。实验结果表明，采用IRCoCo微调预训练语言模型可在代码补全任务中取得显著提升，优于基于监督微调及其他强化学习方法的基线模型。