Large Language models (LLMs) are finding wide use in software engineering practice. These models are extremely data-hungry, and are largely trained on open-source (OSS) code distributed with permissive licenses. In terms of actual use however, a great deal of software development still occurs in the for-profit/proprietary sphere, where the code under development is not, and never has been, in the public domain; thus, many developers, do their work, and use LLMs, in settings where the models may not be as familiar with the code under development. In such settings, do LLMs work as well as they do for OSS code? If not, what are the differences? When performance differs, what are the possible causes, and are there work-arounds? In this paper, we examine this issue using proprietary, closed-source software data from Microsoft, where most proprietary code is in C# and C++. We find that performance for C# changes little from OSS --> proprietary code, but does significantly reduce for C++; we find that this difference is attributable to differences in identifiers. We also find that some performance degradation, in some cases, can be ameliorated efficiently by in-context learning.
翻译:大型语言模型(LLM)正广泛应用于软件工程实践。这些模型对数据需求极大,主要基于采用宽松许可证的开源代码进行训练。然而在实际应用中,大量软件开发仍发生在营利性/专有领域,此类开发中的代码从未进入公共领域;因此,许多开发者在模型可能不熟悉待开发代码的环境中使用LLM进行工作。在这种场景下,LLM能否像处理开源代码一样高效运作?如果不能,差异体现在何处?当性能存在差异时,可能的原因是什么,是否存在解决方法?本文基于微软专有闭源软件数据(主要涉及C#和C++代码)对此问题展开研究。我们发现:C#代码在从开源转向专有场景时性能变化极小,但C++代码性能显著下降;这一差异可归因于标识符的差异。同时我们注意到,在某些情况下,通过上下文学习可以有效缓解部分性能退化问题。