输出格式偏差在大型语言模型代码翻译评估中的影响 (Output Format Biases in the Evaluation of Large Language Models for Code Translation)

from arxiv, This version (v2) is a journal extension of our previous conference submission that was accepted into the 2024 IEEE/ACM First International Conference on AI Foundation Models and Software Engineering (Forge 2024), which includes new experiments and results

Code translation between programming languages (PLs) is a critical task in software engineering, facilitating the modernization of legacy systems, ensuring cross-platform compatibility, and enhancing software performance. Most existing studies instruct LLMs to perform code translation and evaluate their performance by either running the generated outputs through test suites or comparing them to reference outputs (ground truth). These outputs, however, may contain not only executable source code but also additional non-code elements, such as natural language explanations or formatting tokens. We refer to the combination of source code and non-code elements as the output format. It is crucial to understand and address variations in output format, as non-code elements can interfere with evaluation metrics, resulting in biased assessments of model performance and comparisons. We conduct an empirical analysis of the outputs from eleven instruct-tuned open-source LLMs, across five PLs: C, C++, Go, Java, and Python. The results show that between 26.4% and 73.7% of outputs produced by our evaluated LLMs necessitate post-processing. To mitigate output format bias, we propose a strategic combination of prompt engineering and regular expressions that effectively extracts source code from mixed-format outputs, enabling the eleven open-source models to achieve an average Code Extraction Success Rate (CSR) of 92.73%. Our empirical study confirms that output format bias affects widely used execution-based metrics, i.e., Computational Accuracy (CA), and text-based metrics, i.e., BLEU, CodeBLEU and CrystalBLEU. Additionally, we test five closed-source LLMs and observe that they also generate varying distributions of output formats, which could lead to output format biases. Our results highlight the need to mitigate the output format bias to enable reliable evaluations in LLMs for code translation.

翻译：编程语言（PL）间的代码翻译是软件工程中的关键任务，有助于实现遗留系统的现代化、确保跨平台兼容性并提升软件性能。现有研究大多指导大型语言模型执行代码翻译，并通过测试套件运行生成输出或与参考输出（真实值）比较来评估其性能。然而，这些输出不仅可能包含可执行的源代码，还可能包含额外的非代码元素，如自然语言解释或格式化标记。我们将源代码与非代码元素的组合称为输出格式。理解和处理输出格式的变异至关重要，因为非代码元素会干扰评估指标，导致对模型性能及比较的偏差评估。我们对11个经过指令调优的开源大型语言模型在五种编程语言（C、C++、Go、Java和Python）上的输出进行了实证分析。结果显示，被评估模型产生的输出中有26.4%至73.7%需要进行后处理。为减轻输出格式偏差，我们提出结合提示工程与正则表达式的策略方案，能有效从混合格式输出中提取源代码，使11个开源模型达到平均92.73%的代码提取成功率（CSR）。实证研究证实，输出格式偏差会影响广泛使用的基于执行的指标（即计算准确率CA）以及基于文本的指标（即BLEU、CodeBLEU和CrystalBLEU）。此外，我们测试了五个闭源大型语言模型，发现它们同样会产生不同分布的输出格式，可能导致输出格式偏差。我们的研究结果强调，必须减轻输出格式偏差以实现大型语言模型在代码翻译任务中的可靠评估。