On-the-fly Improving Performance of Deep Code Models via Input Denoising

Deep learning has been widely adopted to tackle various code-based tasks by building deep code models based on a large amount of code snippets. While these deep code models have achieved great success, even state-of-the-art models suffer from noise present in inputs leading to erroneous predictions. While it is possible to enhance models through retraining/fine-tuning, this is not a once-and-for-all approach and incurs significant overhead. In particular, these techniques cannot on-the-fly improve performance of (deployed) models. There are currently some techniques for input denoising in other domains (such as image processing), but since code input is discrete and must strictly abide by complex syntactic and semantic constraints, input denoising techniques in other fields are almost not applicable. In this work, we propose the first input denoising technique (i.e., CodeDenoise) for deep code models. Its key idea is to localize noisy identifiers in (likely) mispredicted inputs, and denoise such inputs by cleansing the located identifiers. It does not need to retrain or reconstruct the model, but only needs to cleanse inputs on-the-fly to improve performance. Our experiments on 18 deep code models (i.e., three pre-trained models with six code-based datasets) demonstrate the effectiveness and efficiency of CodeDenoise. For example, on average, CodeDenoise successfully denoises 21.91% of mispredicted inputs and improves the original models by 2.04% in terms of the model accuracy across all the subjects in an average of 0.48 second spent on each input, substantially outperforming the widely-used fine-tuning strategy.

翻译：深度学习已广泛应用于基于代码的任务，通过大量代码片段构建深度代码模型。尽管这些深度代码模型取得了巨大成功，但即使是最先进的模型也面临输入噪声导致的错误预测问题。虽然可以通过重训练/微调来增强模型，但这并非一劳永逸的方法，且会带来显著开销。特别是，这些技术无法即时提升（已部署）模型的性能。目前其他领域（如图像处理）存在一些输入去噪技术，但由于代码输入具有离散性且必须严格遵循复杂的语法和语义约束，这些领域的输入去噪技术几乎无法适用。本文首次提出针对深度代码模型的输入去噪技术（即CodeDenoise）。其核心思想是定位可能被错误预测输入中的噪声标识符，并通过清除这些标识符来实现去噪。该方法无需重训练或重构模型，只需即时清理输入即可提升性能。我们在18个深度代码模型（即三个预训练模型与六个基于代码的数据集）上的实验证明了CodeDenoise的有效性和效率。例如，平均而言，CodeDenoise成功去除了21.91%的错误预测输入噪声，使原始模型在所有实验对象上的准确率提升2.04%，且每个输入仅需0.48秒处理时间，显著优于广泛使用的微调策略。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日