On the Evaluation of Neural Code Translation: Taxonomy and Benchmark

In recent years, neural code translation has gained increasing attention. While most of the research focuses on improving model architectures and training processes, we notice that the evaluation process and benchmark for code translation models are severely limited: they primarily treat source code as natural languages and provide a holistic accuracy score while disregarding the full spectrum of model capabilities across different translation types and complexity. In this paper, we present a comprehensive investigation of four state-of-the-art models and analyze in-depth the advantages and limitations of three existing benchmarks. Based on the empirical results, we develop a taxonomy that categorizes code translation tasks into four primary types according to their complexity and knowledge dependence: token level (type 1), syntactic level (type 2), library level (type 3), and algorithm level (type 4). We then conduct a thorough analysis of how existing approaches perform across these four categories. Our findings indicate that while state-of-the-art code translation models excel in type-1 and type-2 translations, they struggle with knowledge-dependent ones such as type-3 and type-4. Existing benchmarks are biased towards trivial translations, such as keyword mapping. To overcome these limitations, we construct G-TransEval, a new benchmark by manually curating type-3 and type-4 translation pairs and unit test cases. Results on our new benchmark suggest that G-TransEval can exhibit more comprehensive and finer-grained capability of code translation models and thus provide a more rigorous evaluation. Our studies also provide more insightful findings and suggestions for future research, such as building type-3 and type-4 training data and ensembling multiple pretraining approaches.

翻译：近年来，神经代码翻译受到越来越多关注。尽管多数研究聚焦于改进模型架构和训练过程，但我们注意到代码翻译模型的评估流程和基准存在严重局限性：它们主要将源代码视为自然语言，并仅提供整体准确率分数，而忽略了模型在不同翻译类型和复杂度下能力的完整谱系。本文对四种最先进模型进行了全面考察，并深入分析了三个现有基准的优势与局限性。基于实证结果，我们构建了一个分类体系，将代码翻译任务根据其复杂性和知识依赖性划分为四种主要类型：词元级（类型1）、句法级（类型2）、库级（类型3）和算法级（类型4）。随后，我们深入分析了现有方法在这四类任务上的表现。研究结果表明，尽管最先进的代码翻译模型在类型1和类型2翻译上表现优异，但在类型3和类型4等依赖知识的翻译任务上存在困难。现有基准偏向于简单翻译（如关键词映射）。为克服这些局限性，我们通过人工筛选类型3和类型4翻译对及单元测试用例，构建了新基准G-TransEval。在新基准上的实验表明，G-TransEval能够更全面、更细致地展现代码翻译模型的能力，从而提供更严格的评估。我们的研究还为未来方向提供了更具洞察力的发现和建议，例如构建类型3和类型4的训练数据以及集成多种预训练方法。