Generating accurate and executable code using Large Language Models (LLMs) remains a significant challenge for underrepresented programming languages, such as Prolog and Lisp, due to the scarcity of public training data compared to high-resource languages like Python. This paper introduces a generalizable Reinforcement Learning (RL) approach that combines small-scale versions of the Qwen2.5-Coder model with Group Relative Policy Optimization (GRPO) to enable effective code generation through reasoning. To address the limitations of sparse datasets, we integrate execution-driven feedback directly into the RL loop, utilizing a reward system that exploits both logical correctness and structural formatting. Experimental results on GSM8K dataset demonstrate significant improvements in reasoning quality and code accuracy across underrepresented languages. These findings underscore the potential of our approach to benefit a wide range of programming languages lacking extensive training resources by leveraging symbolic reasoning and interpreter-based feedback.
翻译:使用大型语言模型生成准确且可执行的代码,对于Prolog和Lisp等低资源编程语言仍是一项重大挑战,原因在于其公共训练数据相较于Python等高资源语言极为稀缺。本文提出一种可泛化的强化学习方法,该方法将Qwen2.5-Coder模型的小型版本与组相对策略优化相结合,通过推理实现高效的代码生成。为应对稀疏数据集的局限性,我们将执行驱动的反馈直接融入强化学习循环,利用同时利用逻辑正确性与结构化格式的奖励系统。在GSM8K数据集上的实验结果表明,该方法显著提升了低资源语言的推理质量与代码准确性。这些发现凸显了我们的方法通过利用符号推理与解释器反馈,能够惠及缺乏大量训练资源的广泛编程语言的潜力。