Agnostics: Learning to Code in Any Programming Language via Reinforcement with a Universal Learning Environment

Large language models (LLMs) already excel at writing code in high-resource languages such as Python and JavaScript, yet stumble on low-resource languages that remain essential to science and engineering. Besides the obvious shortage of pre-training data, post-training itself is a bottleneck: every new language seems to require new datasets, test harnesses, and reinforcement-learning (RL) infrastructure. We introduce Agnostics, a language-agnostic post-training pipeline that eliminates this per-language engineering. The key idea is to judge code solely by its externally observable behavior, so a single verifier can test solutions written in any language. Concretely, we (i) use an LLM to rewrite existing unit-test datasets into an I/O format, (ii) supply a short configuration that tells the verifier how to compile and run a target language, and (iii) apply reinforcement learning with verifiable rewards (RLVR) in a robust code execution environment. Applied to five low-resource languages--Lua, Julia, R, OCaml, and Fortran--Agnostics (1) improves Qwen-3 4B to performance that rivals other 16B-70B open-weight models; (2) scales cleanly to larger and diverse model families (Qwen-3 8B, DeepSeek Coder 6.7B Instruct, Phi 4 Mini); and (3) for ${\le} 16$B parameter models, sets new state-of-the-art pass@1 results on MultiPL-E and a new multi-language version of LiveCodeBench that we introduce. We release the language-agnostic training datasets (Ag-MBPP-X, Ag-Codeforces-X, Ag-LiveCodeBench-X), training code, and ready-to-use configurations, making RL post-training in any programming language as simple as editing a short YAML file.

翻译：摘要：大型语言模型（LLMs）已在Python和JavaScript等高资源语言的代码生成领域表现卓越，但在对科学和工程至关重要的低资源语言方面仍存在明显不足。除了预训练数据缺乏这一显见问题外，后训练本身也成为瓶颈：每种新语言似乎都需要新的数据集、测试框架和强化学习基础设施。我们提出Agnostics，一种消除语言特异性工程的语言无关后训练流程。其核心思想是仅通过代码的外部可观测行为进行评判，使单一验证器能够测试任何语言编写的解决方案。具体而言，我们（i）利用LLM将现有单元测试数据集重写为I/O格式；（ii）提供简短的配置，指导验证器如何编译和运行目标语言；以及（iii）在稳健的代码执行环境中应用基于可验证奖励的强化学习（RLVR）。在五种低资源语言——Lua、Julia、R、OCaml和Fortran上的实验表明，Agnostics能够：（1）将Qwen-3 4B模型性能提升至与16B-70B参数级别的其他开源模型相当的水平；（2）干净地扩展到更大规模和多样化的模型家族（Qwen-3 8B、DeepSeek Coder 6.7B Instruct、Phi 4 Mini）；（3）对于参数规模≤16B的模型，在MultiPL-E基准测试及我们提出的多语言版本LiveCodeBench上创下新的pass@1最优结果。我们开源了语言无关训练数据集（Ag-MBPP-X、Ag-Codeforces-X、Ag-LiveCodeBench-X）、训练代码及即用型配置文件，使得对任意编程语言进行强化学习后训练变得如同编辑简短YAML文件般简单。