Active Code Learning: Benchmarking Sample-Efficient Training of Code Models

The costly human effort required to prepare the training data of machine learning (ML) models hinders their practical development and usage in software engineering (ML4Code), especially for those with limited budgets. Therefore, efficiently training models of code with less human effort has become an emergent problem. Active learning is such a technique to address this issue that allows developers to train a model with reduced data while producing models with desired performance, which has been well studied in computer vision and natural language processing domains. Unfortunately, there is no such work that explores the effectiveness of active learning for code models. In this paper, we bridge this gap by building the first benchmark to study this critical problem - active code learning. Specifically, we collect 11 acquisition functions~(which are used for data selection in active learning) from existing works and adapt them for code-related tasks. Then, we conduct an empirical study to check whether these acquisition functions maintain performance for code data. The results demonstrate that feature selection highly affects active learning and using output vectors to select data is the best choice. For the code summarization task, active code learning is ineffective which produces models with over a 29.64\% gap compared to the expected performance. Furthermore, we explore future directions of active code learning with an exploratory study. We propose to replace distance calculation methods with evaluation metrics and find a correlation between these evaluation-based distance methods and the performance of code models.

翻译：机器学习模型训练数据的准备需要昂贵的人力成本，这阻碍了其在软件工程领域（ML4Code）的实际开发和应用，尤其是对于预算有限的项目而言。因此，如何以更少的人力高效训练代码模型已成为一个新兴问题。主动学习正是解决这一问题的技术，它允许开发者用更少的数据训练模型，同时获得具有期望性能的模型。这一技术在计算机视觉和自然语言处理领域已得到充分研究。然而，目前尚无研究探索主动学习在代码模型中的有效性。在本文中，我们通过构建首个基准来弥合这一空白，研究这一关键问题——主动代码学习。具体而言，我们从现有工作中收集了11种采集函数（用于主动学习中的数据选择），并将其适配于代码相关任务。接着，我们进行实证研究，检验这些采集函数是否在代码数据上保持性能。结果表明，特征选择对主动学习影响显著，使用输出向量选择数据是最佳选择。对于代码摘要任务，主动代码学习效果不佳，与期望性能相比存在超过29.64%的差距。此外，我们通过探索性研究探讨了主动代码学习的未来方向。我们提出用评估指标替代距离计算方法，并发现这些基于评估的距离方法与代码模型性能之间存在相关性。