Recently, large language models (LLMs) have become increasingly powerful and have become capable of solving a plethora of tasks through proper instructions in natural language. However, the vast majority of testing suites assume that the instructions are written in English, the de facto prompting language. Code intelligence and problem solving still remain a difficult task, even for the most advanced LLMs. Currently, there are no datasets to measure the generalization power for code-generation models in a language other than English. In this work, we present RoCode, a competitive programming dataset, consisting of 2,642 problems written in Romanian, 11k solutions in C, C++ and Python and comprehensive testing suites for each problem. The purpose of RoCode is to provide a benchmark for evaluating the code intelligence of language models trained on Romanian / multilingual text as well as a fine-tuning set for pretrained Romanian models. Through our results and review of related works, we argue for the need to develop code models for languages other than English.
翻译:近年来,大型语言模型(LLMs)日益强大,已能通过恰当的自然语言指令解决大量任务。然而,绝大多数测试套件假定指令以英语(事实上的提示语言)书写。代码智能与问题求解即便对最先进的LLMs而言仍属困难任务。当前尚无数据集可用于衡量代码生成模型在英语以外语言上的泛化能力。本研究提出RoCode——一个竞赛编程数据集,包含2642道以罗马尼亚语编写的问题、1.1万份C/C++与Python解决方案,以及每道问题的完整测试套件。RoCode旨在为评估基于罗马尼亚语/多语言文本训练的语言模型的代码智能提供基准,同时作为预训练罗马尼亚语模型的微调数据集。基于实验结果与相关文献分析,我们主张有必要开发英语以外语言的代码模型。