Can Emulating Semantic Translation Help LLMs with Code Translation? A Study Based on Pseudocode

Large language models (LLMs) show great potential in code translation. However, accurate translation remains challenging when using the commonly adopted direct code-to-code translation approach, which converts a program into the target programming language (PL) in a single step. Inspired by the success of incorporating intermediate steps to guide LLMs in resolving challenging tasks, we explore pseudocode-based code translation, which emulates the human semantic translation by first interpreting the program's intent and logic into pseudocode and then implementing it in the target PL. We find that pseudocode-based translation helps translate programs that direct translation struggles to handle. Nonetheless, the effectiveness, advantages, and limitations of this approach remain underexplored. To bridge this gap, we present an empirical study on pseudocode-based code translation, aiming to investigate its effectiveness in enhancing the direct translation approach, illuminate its effective usage, and identify limitations hindering its potential benefits. By comparing direct and pseudocode-based translation approaches on 9,690 translation tasks across six PLs with five popular LLMs, we demonstrate that pseudocode-based translation can effectively complement direct translation, particularly when translating from flexible to rigid PLs or dealing with low-resource Rust. Based on these findings, we suggest adopting strategies that combine the complementary strengths of both approaches to enhance code translation accuracy. We also reveal the advantages of pseudocode-based translation in disentangling translations of complicated programs and mitigating distractions from detailed implementations in original programs, as well as its limitations due to incorrect, incomplete, or ambiguous pseudocode.

翻译：大语言模型（LLMs）在代码翻译方面展现出巨大潜力。然而，采用常见的直接代码到代码翻译方法——即单步将程序转换为目标编程语言（PL）——仍难以实现精确翻译。受引入中间步骤指导LLMs解决复杂任务的成功实践启发，我们探索基于伪代码的代码翻译方法，该方法通过先将程序的意图和逻辑解释为伪代码，再在目标PL中实现，从而模拟人类的语义翻译过程。我们发现，基于伪代码的翻译有助于处理直接翻译难以应对的程序。尽管如此，该方法的有效性、优势及局限性仍未得到充分探究。为填补这一空白，我们开展了一项关于基于伪代码的代码翻译的实证研究，旨在探究其在增强直接翻译方法方面的有效性，阐明其有效使用方式，并识别阻碍其潜在效益的局限性。通过在六种编程语言的9,690个翻译任务上，使用五种主流LLMs对比直接翻译与基于伪代码的翻译方法，我们证明基于伪代码的翻译能有效补充直接翻译，尤其在从灵活PL翻译至严格PL或处理低资源Rust语言时效果显著。基于这些发现，我们建议采用结合两种方法互补优势的策略，以提高代码翻译的准确性。我们还揭示了基于伪代码的翻译在解耦复杂程序翻译、减轻原始程序中具体实现细节干扰方面的优势，以及因伪代码错误、不完整或模糊性导致的局限性。