Recently, pre-trained programming language models such as CodeBERT have demonstrated substantial gains in code search. Despite showing great performance, they rely on the availability of large amounts of parallel data to fine-tune the semantic mappings between queries and code. This restricts their practicality in domain-specific languages with relatively scarce and expensive data. In this paper, we propose CroCS, a novel approach for domain-specific code search. CroCS employs a transfer learning framework where an initial program representation model is pre-trained on a large corpus of common programming languages (such as Java and Python) and is further adapted to domain-specific languages such as SQL and Solidity. Unlike cross-language CodeBERT, which is directly fine-tuned in the target language, CroCS adapts a few-shot meta-learning algorithm called MAML to learn the good initialization of model parameters, which can be best reused in a domain-specific language. We evaluate the proposed approach on two domain-specific languages, namely, SQL and Solidity, with model transferred from two widely used languages (Python and Java). Experimental results show that CDCS significantly outperforms conventional pre-trained code models that are directly fine-tuned in domain-specific languages, and it is particularly effective for scarce data.
翻译:最近,诸如CodeBERT等预训练编程语言模型在代码搜索任务中展现出了显著成效。尽管性能优异,但这些模型依赖大量并行数据来微调查询与代码之间的语义映射关系。这一特性限制了其在数据稀缺且成本高昂的领域特定语言中的实用性。本文提出了一种面向领域特定代码搜索的新方法CroCS。该方法采用迁移学习框架,首先在通用编程语言(如Java和Python)大规模语料上预训练初始程序表征模型,进而将其适配至SQL和Solidity等特定领域语言。与直接对目标语言进行微调的跨语言CodeBERT不同,CroCS采用名为MAML的少样本元学习算法来学习可最佳复用于领域特定语言的模型参数初始化。我们在SQL和Solidity两种特定领域语言上,通过从Python和Java两种广泛使用的语言迁移模型进行了评估。实验结果表明,CroCS显著优于直接对领域特定语言进行微调的经典预训练代码模型,尤其在数据稀缺场景下表现尤为突出。