Algorithm-Based Pipeline for Reliable and Intent-Preserving Code Translation with LLMs

Code translation, the automatic conversion of programs between languages, is a growing use case for Large Language Models (LLMs). However, direct one-shot translation often fails to preserve program intent, leading to errors in control flow, type handling, and I/O behavior. We propose an algorithm-based pipeline that introduces a language-neutral intermediate specification to capture these details before code generation. This study empirically evaluates the extent to which structured planning can improve translation accuracy and reliability relative to direct translation. We conduct an automated paired experiment - direct and algorithm-based to translate between Python and Java using five widely used LLMs on the Avatar and CodeNet datasets. For each combination (model, dataset, approach, and direction), we compile and execute the translated program and run the tests provided. We record compilation results, runtime behavior, timeouts (e.g., infinite loop), and test outcomes. We compute accuracy from these tests, counting a translation as correct only if it compiles, runs without exceptions or timeouts, and passes all tests. We then map every failed compile-time and runtime case to a unified, language-aware taxonomy and compare subtype frequencies between the direct and algorithm-based approaches. Overall, the Algorithm-based approach increases micro-average accuracy from 67.7% to 78.5% (10.8% increase). It eliminates lexical and token errors by 100%, reduces incomplete constructs by 72.7%, and structural and declaration issues by 61.1%. It also substantially lowers runtime dependency and entry-point failures by 78.4%. These results demonstrate that algorithm-based pipelines enable more reliable, intent-preserving code translation, providing a foundation for robust multilingual programming assistants.

翻译：代码翻译，即程序在语言间的自动转换，正日益成为大型语言模型（LLMs）的重要应用场景。然而，直接的单次翻译往往难以保持程序意图，导致控制流、类型处理和I/O行为等方面的错误。我们提出一种基于算法的流水线，在代码生成前引入一种语言无关的中间规范来捕获这些细节。本研究通过实证评估结构化规划相较于直接翻译能在多大程度上提升翻译的准确性和可靠性。我们在Avatar和CodeNet数据集上，使用五种广泛采用的LLMs，设计了一项自动化的配对实验——分别采用直接翻译和基于算法的方法进行Python与Java之间的双向翻译。针对每种组合（模型、数据集、方法和翻译方向），我们对翻译后的程序进行编译、执行，并运行提供的测试用例。我们记录编译结果、运行时行为、超时情况（例如无限循环）以及测试结果。我们根据这些测试计算准确率，仅当翻译后的代码能够编译、运行无异常或超时且通过所有测试时，才将其计为正确。随后，我们将所有编译时和运行时的失败案例映射到一个统一的、语言感知的分类体系中，并比较直接翻译与基于算法的方法在各子类型上的出现频率。总体而言，基于算法的方法将微平均准确率从67.7%提升至78.5%（提高了10.8%）。它完全消除了词汇和词法错误（减少100%），将不完整结构问题降低了72.7%，并将结构和声明类问题减少了61.1%。同时，该方法还将运行时依赖和入口点失败大幅降低了78.4%。这些结果表明，基于算法的流水线能够实现更可靠、意图保持的代码翻译，为构建稳健的多语言编程助手奠定了基础。