Large Language Models (LLMs) have significantly advanced the automation of software engineering tasks. One prominent example is code generation, where an LLM produces code in a specified programming language based on a natural language description. Most research in this area has focused on high-resource languages, such as Python or Java, which benefit from abundant training data. A smaller body of work has explored low-resource languages, which are underrepresented in training corpora. In contrast, no-resource languages for which LLMs have seen virtually no training data remain largely unstudied. These languages often emerge in industry, where organizations develop proprietary or domain-specific languages unsupported by commercial tools like GitHub Copilot. This results in the need for companies to deploy their own in-house code recommenders. To investigate possible solutions in this context, we build and release three code generation benchmarks for no-resource languages, based on two recently proposed programming languages for which very little training data is available. Using these benchmarks, we experiment several solutions to teach LLMs about no-resource languages, including prompt-based techniques as well as pre-training and fine-tuning exploiting the little data available. While further pre-training gives the largest performance gains for no-resource languages, applying it directly to instruction-tuned models harms their ability to follow instructions. To address this, we start from a base model, further pre-training it on the target language, and then inject instruction-following capabilities via weight diff transfer from an instruction model. Such an approach significantly improves code generation capabilities in no-resource settings, allowing companies to cheaply deploy a specialized instruct model without dealing with the computational cost of instruction fine-tuning.
翻译:大语言模型(LLMs)已显著推动了软件工程任务的自动化进程。其中,代码生成是典型范例——模型根据自然语言描述生成指定编程语言的代码。现有研究多聚焦于Python、Java等高资源语言,这类语言受益于丰富的训练数据。部分工作虽已探索训练语料中代表性不足的低资源语言,但对于LLM几乎未见训练数据的无资源语言仍缺乏系统研究。此类语言常源于工业界:企业开发的专有或领域特定语言缺乏GitHub Copilot等商业工具支持,迫使企业部署自建代码推荐系统。为探索可行解决方案,我们基于两个近期提出的、训练数据极为匮乏的编程语言,构建并发布了三个无资源语言代码生成基准。利用这些基准,我们实验了多种方案以教会LLM处理无资源语言,包括基于提示的技术,以及利用少量数据进行预训练与微调的方法。结果发现,对无资源语言而言,进一步预训练虽能带来最大性能提升,但直接应用于指令微调模型会损害其遵循指令的能力。为此,我们从基础模型出发,先对其在目标语言上进行预训练,再通过从指令模型中迁移权重差来注入指令遵循能力。该方法显著提升了无资源场景下的代码生成能力,使企业无需承担指令微调的计算成本,即可廉价部署专用指令模型。