Exploring the Impact of the Output Format on the Evaluation of Large Language Models for Code Translation

Code translation between programming languages is a long-existing and critical task in software engineering, facilitating the modernization of legacy systems, ensuring cross-platform compatibility, and enhancing software performance. With the recent advances in large language models (LLMs) and their applications to code translation, there is an increasing need for comprehensive evaluation of these models. In this study, we empirically analyze the generated outputs of eleven popular instruct-tuned LLMs with parameters ranging from 1B up to 46.7B on 3,820 translation pairs across five languages, including C, C++, Go, Java, and Python. Our analysis found that between 26.4% and 73.7% of code translations produced by our evaluated LLMs necessitate post-processing, as these translations often include a mix of code, quotes, and text rather than being purely source code. Overlooking the output format of these models can inadvertently lead to underestimation of their actual performance. This is particularly evident when evaluating them with execution-based metrics such as Computational Accuracy (CA). Our results demonstrate that a strategic combination of prompt engineering and regular expression can effectively extract the source code from the model generation output. In particular, our method can help eleven selected models achieve an average Code Extraction Success Rate (CSR) of 92.73%. Our findings shed light on and motivate future research to conduct more reliable benchmarks of LLMs for code translation.

翻译：编程语言间的代码翻译是软件工程中长期存在且至关重要的任务，它有助于实现遗留系统的现代化、确保跨平台兼容性并提升软件性能。随着大语言模型（LLMs）及其在代码翻译中应用的快速发展，全面评估这些模型的需求日益迫切。本研究对11种主流指令微调LLM（参数量从1B到46.7B不等）在C、C++、Go、Java和Python五种语言共3,820个翻译对上的生成输出进行了实证分析。分析发现，所评估LLM生成的代码翻译中，有26.4%至73.7%需要后处理，因为这些输出通常混杂代码、引号和文本，而非纯粹的源代码。忽略这些模型的输出格式可能导致对其实际性能的低估，尤其是当采用计算准确性（CA）等基于执行的指标进行评估时尤为明显。研究结果表明，通过策略性地结合提示工程与正则表达式，可有效从模型生成输出中提取源代码。具体而言，我们的方法帮助11个选定模型实现了平均92.73%的代码提取成功率（CSR）。本研究的发现为未来开展更可靠的LLM代码翻译基准测试提供了启示与推动。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/