FT2Ra: A Fine-Tuning-Inspired Approach to Retrieval-Augmented Code Completion

The rise of code pre-trained models has significantly enhanced various coding tasks, such as code completion, and tools like GitHub Copilot. However, the substantial size of these models, especially large models, poses a significant challenge when it comes to fine-tuning them for specific downstream tasks. As an alternative approach, retrieval-based methods have emerged as a promising solution, augmenting model predictions without the need for fine-tuning. Despite their potential, a significant challenge is that the designs of these methods often rely on heuristics, leaving critical questions about what information should be stored or retrieved and how to interpolate such information for augmenting predictions. To tackle this challenge, we first perform a theoretical analysis of the fine-tuning process, highlighting the importance of delta logits as a catalyst for improving model predictions. Building on this insight, we develop a novel retrieval-based method, FT2Ra, which aims to mimic genuine fine-tuning. While FT2Ra adopts a retrieval-based mechanism, it uniquely adopts a paradigm with a learning rate and multi-epoch retrievals, which is similar to fine-tuning.In token-level completion, which represents a relatively easier task, FT2Ra achieves a 4.29% improvement in accuracy compared to the best baseline method on UniXcoder. In the more challenging line-level completion task, we observe a substantial more than twice increase in Exact Match (EM) performance, indicating the significant advantages of our theoretical analysis. Notably, even when operating without actual fine-tuning, FT2Ra exhibits competitive performance compared to the models with real fine-tuning.

翻译：代码预训练模型的兴起显著提升了代码补全等编程任务及GitHub Copilot等工具的效能。然而，这些模型（尤其是大型模型）的庞大体量使得针对特定下游任务进行微调面临重大挑战。作为替代方案，基于检索的方法应运而生，无需微调即可增强模型预测。尽管潜力巨大，但这类方法的设计常依赖启发式策略，关于应存储或检索何种信息、以及如何内插这些信息以增强预测的关键问题仍未解决。为此，我们首先对微调过程进行理论分析，强调delta logits作为改善模型预测催化剂的的重要性。基于这一见解，我们提出新型检索方法FT2Ra，旨在模拟真实微调过程。FT2Ra虽采用检索机制，却独创性地引入学习率和多轮检索范式（类似微调）。在相对简单的令牌级代码补全任务中，FT2Ra在UniXcoder上相比最优基线方法准确率提升4.29%；在更具挑战性的行级补全任务中，精确匹配（EM）性能提升超过两倍，充分彰显理论分析的显著优势。值得注意的是，即便未经历实际微调，FT2Ra仍展现出与真实微调模型相媲美的竞争力。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日