Learning-based techniques, especially advanced Large Language Models (LLMs) for code, have gained considerable popularity in various software engineering (SE) tasks. However, most existing works focus on designing better learning-based models and pay less attention to the properties of datasets. Learning-based models, including popular LLMs for code, heavily rely on data, and the data's properties (e.g., data distribution) could significantly affect their behavior. We conducted an exploratory study on the distribution of SE data and found that such data usually follows a skewed distribution (i.e., long-tailed distribution) where a small number of classes have an extensive collection of samples, while a large number of classes have very few samples. We investigate three distinct SE tasks and analyze the impacts of long-tailed distribution on the performance of LLMs for code. Our experimental results reveal that the long-tailed distribution has a substantial impact on the effectiveness of LLMs for code. Specifically, LLMs for code perform between 30.0\% and 254.0\% worse on data samples associated with infrequent labels compared to data samples of frequent labels. Our study provides a better understanding of the effects of long-tailed distributions on popular LLMs for code and insights for the future development of SE automation.
翻译:基于学习的技术,特别是用于代码的先进大型语言模型,在各类软件工程任务中已获得广泛关注。然而,现有工作大多聚焦于设计更优的学习模型,而对数据集特性的关注不足。学习模型(包括流行的代码大语言模型)高度依赖数据,而数据特性(如数据分布)可能显著影响其行为。我们通过探索性研究软件工程数据分布发现,这类数据通常呈现偏态分布(即长尾分布):少数类别拥有大量样本,而多数类别仅有极少量样本。我们针对三个不同的软件工程任务展开研究,分析长尾分布对代码大语言模型性能的影响。实验结果表明,长尾分布对代码大语言模型的有效性产生显著影响。具体而言,与常见标签对应的数据样本相比,代码大语言模型在低频标签样本上的性能下降幅度介于30.0%至254.0%之间。本研究加深了对长尾分布如何影响主流代码大语言模型的理解,并为软件工程自动化领域的未来发展提供了启示。