COBOL remains a critical language for mainframe systems, yet existing large language models (LLMs) struggle to generate and translate COBOL code correctly. This paper reports our experience in developing and evaluating domain-adapted LLMs for COBOL and mainframe software engineering. We introduce (1) an automated data curation pipeline that combines compiler-guided validation with multi-stage similarity-based filtering to construct high-quality COBOL training data, and (2) COBOL-Coder, a COBOL-specialized LLM fine-tuned on the curated COBOL domain data. We evaluate COBOL-Coder on two tasks: code generation (on COBOLEval and COBOLCodeBench) and code translation (on COBOL-JavaTrans, our proposed benchmark for bidirectional COBOL-Java translation). In our experiments, COBOL-Coder achieves up to a 73.95 percent compilation success rate and 49.33 Pass-1 on COBOLEval, compared to 41.8 percent and 16.4 for GPT-4o, while most open-source baselines (e.g., CodeGemma, CodeLlama, StarCoder2) fail to produce compilable programs. For Java-to-COBOL translation, COBOL-Coder reaches 34.93 Pass-1, whereas general-purpose LLMs achieve near-zero scores. To assess the usability of LLM-generated code in real-world settings, we conduct a survey with experienced COBOL developers. Participants consistently report that COBOL-Coder exhibits stronger COBOL awareness, has more reliable program structure, and is better aligned with enterprise practices than general-purpose LLMs.
翻译:COBOL仍是大型机系统的关键语言,但现有大语言模型(LLMs)在正确生成和翻译COBOL代码方面存在困难。本文报告了我们在开发和评估面向COBOL及大型机软件工程的领域适配LLMs方面的经验。我们提出:(1)一种自动化数据整理流程,结合编译器引导验证与多阶段基于相似性的过滤,构建高质量COBOL训练数据;(2)COBOL-Coder——基于所整理的COBOL领域数据微调而成的COBOL专用LLM。我们在两项任务上评估COBOL-Coder:代码生成(在COBOLEval和COBOLCodeBench上)和代码翻译(在COBOL-JavaTrans上——我们提出的用于双向COBOL-Java翻译的基准测试)。实验中,COBOL-Coder在COBOLEval上实现了最高73.95%的编译成功率和49.33的Pass-1指标,而GPT-4o分别为41.8%和16.4,且大多数开源基线模型(如CodeGemma、CodeLlama、StarCoder2)无法生成可编译程序。在Java到COBOL翻译任务中,COBOL-Coder达到34.93的Pass-1,而通用LLMs得分接近于零。为评估LLM生成代码在实际场景中的可用性,我们针对经验丰富的COBOL开发者开展了问卷调查。参与者一致反馈,COBOL-Coder在COBOL意识、程序结构可靠性及与企业实践一致性方面均优于通用LLMs。