Can Emulating Semantic Translation Help LLMs with Code Translation? A Study Based on Pseudocode

Although large language models (LLMs) show promising potential in code translation, they still struggle to generate accurate translations using the commonly adopted direct code-to-code translation approach, which converts an original program into the target programming language (PL) in a single step. Inspired by the success of incorporating intermediate steps to guide LLMs in resolving challenging tasks, in this study, we explore pseudocode-based code translation. This approach emulates human semantic translation by first interpreting the original program's intent and logic into pseudocode and then implementing it in the target PL. To understand the effectiveness of this underexplored approach, we present a systematic empirical study on pseudocode-based code translation, aiming to investigate its helpfulness in enhancing the direct translation approach, illuminate its effective usage, and identify its limitations. By comparing direct and pseudocode-based translation on 9,690 translation tasks across six PLs with five popular LLMs, we found that pseudocode-based translation can effectively complement direct translation, particularly when translating from flexible to rigid PLs and handling a low-training-resource PL. Based on the findings, we suggest combining the translation results of both approaches for test-based selection to leverage their complementary strengths. We also reveal the advantages of pseudocode-based translation in decoupling the code understanding and generation burden on complicated programs and mitigating distractions from PL-specific implementations in original programs, as well as its limitations due to incorrect, incomplete, or ambiguous pseudocode. Our study sheds light on the effective use of pseudocode-based translation and provides evidence to help enhance LLMs in code translation.

翻译：尽管大语言模型在代码翻译中展现出潜力，但采用常见的直接代码到代码翻译方法（即单步将原始程序转换为目标编程语言）时，模型仍难以生成准确的翻译结果。受引入中间步骤引导大语言模型解决复杂任务的成功实践启发，本研究探索基于伪代码的代码翻译方法。该方法通过先将原始程序的意图和逻辑解释为伪代码，再将其实现为目标编程语言，从而模拟人类的语义翻译过程。为理解这一尚未充分探索方法的有效性，我们开展了基于伪代码代码翻译的系统性实证研究，旨在探究其对直接翻译方法的增强作用、阐明其有效使用场景并识别其局限性。通过在六种编程语言上对五款主流大语言模型开展9,690项翻译任务的对比实验，我们发现基于伪代码的翻译能有效补充直接翻译，尤其在从灵活型语言翻译至严格型语言、以及处理低训练资源编程语言时表现显著。基于研究结果，我们建议通过测试择优选择机制结合两种方法的翻译结果，以发挥其互补优势。研究同时揭示了基于伪代码翻译在解耦复杂程序的理解与生成负担、降低原始语言具体实现干扰方面的优势，以及因伪代码错误、不完整或歧义导致的局限性。本研究阐明了基于伪代码翻译的有效使用方式，并为增强大语言模型的代码翻译能力提供了实证依据。