Bridge-Coder: Unlocking LLMs' Potential to Overcome Language Gaps in Low-Resource Code

Large Language Models (LLMs) demonstrate strong proficiency in generating code for high-resource programming languages (HRPLs) like Python but struggle significantly with low-resource programming languages (LRPLs) such as Racket or D. This performance gap deepens the digital divide, preventing developers using LRPLs from benefiting equally from LLM advancements and reinforcing disparities in innovation within underrepresented programming communities. While generating additional training data for LRPLs is promising, it faces two key challenges: manual annotation is labor-intensive and costly, and LLM-generated LRPL code is often of subpar quality. The underlying cause of this issue is the gap between natural language to programming language gap (NL-PL Gap), which is especially pronounced in LRPLs due to limited aligned data. In this work, we introduce a novel approach called Bridge-Coder, which leverages LLMs' intrinsic capabilities to enhance the performance on LRPLs. Our method consists of two key stages. Bridge Generation, where we create high-quality dataset by utilizing LLMs' general knowledge understanding, proficiency in HRPLs, and in-context learning abilities. Then, we apply the Bridged Alignment, which progressively improves the alignment between NL instructions and LRPLs. Experimental results across multiple LRPLs show that Bridge-Coder significantly enhances model performance, demonstrating the effectiveness and generalization of our approach. Furthermore, we offer a detailed analysis of the key components of our method, providing valuable insights for future work aimed at addressing the challenges associated with LRPLs.

翻译：大型语言模型（LLMs）在生成高资源编程语言（如Python）代码方面表现出色，但在处理低资源编程语言（如Racket或D）时却面临显著困难。这种性能差距加剧了数字鸿沟，使得使用低资源编程语言的开发者无法平等受益于LLM的进步，并强化了在代表性不足的编程社区内的创新不平等。虽然为低资源编程语言生成额外训练数据具有前景，但面临两大挑战：手动标注劳动密集且成本高昂，而LLM生成的低资源编程语言代码质量往往欠佳。这一问题的根本原因在于自然语言与编程语言之间的鸿沟，由于对齐数据有限，这一鸿沟在低资源编程语言中尤为突出。本研究提出了一种名为Bridge-Coder的新方法，该方法利用LLMs的内在能力来提升对低资源编程语言的性能。我们的方法包含两个关键阶段：桥梁生成阶段，通过利用LLMs的通用知识理解能力、对高资源编程语言的熟练度以及上下文学习能力，创建高质量数据集；随后应用桥接对齐阶段，逐步改善自然语言指令与低资源编程语言之间的对齐。在多种低资源编程语言上的实验结果表明，Bridge-Coder显著提升了模型性能，证明了我们方法的有效性和泛化能力。此外，我们对方法的关键组成部分进行了详细分析，为未来解决低资源编程语言相关挑战的研究提供了有价值的见解。