A key consideration when training an LLM is whether the target language is more or less resourced, whether this is English compared to Welsh, or Python compared to Excel. Typical training data for programming languages consist of real program demonstrations coupled with human-written comments. Here we present novel approaches to the creation of such data for low resource programming languages. We generate fully-synthetic, textbook-quality demonstrations of common library functions in an example domain of Excel formulas, using a teacher model. We then finetune an underperforming student model, and show improvement on 2 question-answering datasets recast into the Excel domain. We show advantages of finetuning over standard, off-the-shelf RAG approaches, which can offer only modest improvement due to the unfamiliar target domain.
翻译:训练大型语言模型时,一个关键考量是目标语言资源丰富程度的高低,无论是英语与威尔士语的对比,还是Python与Excel的对比。编程语言的典型训练数据通常由真实程序演示和人工编写的注释组成。本文针对低资源编程语言,提出了创建此类数据的新方法。我们使用教师模型,在Excel公式示例领域中生成了完全合成的、教科书质量的常用库函数演示。随后对一个表现欠佳的学生模型进行微调,并在两个重构至Excel领域的问题回答数据集上展示了性能提升。我们证明了微调方法相较于标准现成的检索增强生成(RAG)方法的优势,后者因目标领域陌生性仅能提供有限的改进。