Language models can serve as a valuable tool for software developers to increase productivity. Large generative models can be used for code generation and code completion, while smaller encoder-only models are capable of performing code search tasks using natural language queries.These capabilities are heavily influenced by the quality and diversity of the available training data. Source code datasets used for training usually focus on the most popular languages and testing is mostly conducted on the same distributions, often overlooking low-resource programming languages. Motivated by the NLP generalization taxonomy proposed by Hupkes et.\,al., we propose a new benchmark dataset called GenCodeSearchNet (GeCS) which builds upon existing natural language code search datasets to systemically evaluate the programming language understanding generalization capabilities of language models. As part of the full dataset, we introduce a new, manually curated subset StatCodeSearch that focuses on R, a popular but so far underrepresented programming language that is often used by researchers outside the field of computer science. For evaluation and comparison, we collect several baseline results using fine-tuned BERT-style models and GPT-style large language models in a zero-shot setting.
翻译:语言模型可作为软件开发人员提升生产力的重要工具。大型生成模型可用于代码生成与代码补全,而较小的仅编码器模型则能通过自然语言查询执行代码搜索任务。这些能力很大程度上受限于训练数据的质量与多样性。用于训练的源代码数据集通常聚焦于主流编程语言,测试也大多基于相同的数据分布,往往忽视低资源编程语言。受Hupkes等人提出的NLP泛化分类学启发,我们提出了名为GenCodeSearchNet(GeCS)的新基准数据集。该数据集基于现有自然语言代码搜索数据集构建,旨在系统评估语言模型在编程语言理解方面的泛化能力。作为完整数据集的一部分,我们引入了一个经过人工标注的新子集StatCodeSearch,该子集专注于R语言——一种流行但此前代表性不足、常被非计算机科学领域研究人员使用的编程语言。为进行评测与对比,我们收集了若干基线结果,包括使用微调后的BERT风格模型以及GPT风格的大型语言模型在零样本场景下的性能表现。