将基于LLM的代码翻译基准测试提升至类级别时代 (Escalating LLM-based Code Translation Benchmarking into the Class-level Era)

In recent years, Large Language Models (LLMs) have dramatically advanced the performance of automated code translation, making their computational accuracy score reach up to over 80% on many previous benchmarks. However, most code samples in these benchmarks are short, standalone, statement/method-level, and algorithmic, which is not aligned with practical coding tasks. Therefore, it is still unknown the actual capability of LLMs in translating code samples written for daily development. To achieve this, we construct a class-level code translation benchmark, ClassEval-T, and make the first attempt to extensively assess recent LLMs' performance on class-level code translation. ClassEval-T is extended from ClassEval, a well-known class-level Python code generation benchmark consisting of multiple practical coding topics, such as database operation and game design, and diverse contextual dependencies (e.g., fields, methods, and libraries). It cost us 360 person-hours to accomplish the manual migration to Java and C++ with complete code samples and associated test suites. Subsequently, we design three translation strategies (i.e., holistic, min-dependency, and standalone) for class-level code translations and evaluate eight recent LLMs of commercial, general, and code kinds in diverse families and sizes on ClassEval-T. Experimental results demonstrate a remarkable performance drop compared with the most widely studied method-level code translation benchmark, and obvious discrepancies among LLMs appear, showing the effectiveness of ClassEval-T in measuring recent LLMs. Afterwards, we further discuss the usage scenarios for diverse translation strategies and LLMs' ability to dependency awareness when translating class samples. Finally, 1,243 failure cases made by the best-performing LLM under test are analyzed and categorized in this paper for practical guidance and future enlightenment.

翻译：近年来，大型语言模型（LLMs）显著提升了自动化代码翻译的性能，使其在许多现有基准测试中的计算准确率得分达到80%以上。然而，这些基准测试中的大多数代码样本是简短、独立、语句/方法级别且算法导向的，与实际编码任务不符。因此，LLMs在翻译日常开发所编写的代码样本时的实际能力仍属未知。为此，我们构建了一个类级别代码翻译基准测试ClassEval-T，并首次尝试广泛评估近期LLMs在类级别代码翻译上的性能。ClassEval-T扩展自ClassEval——一个著名的类级别Python代码生成基准测试，包含多个实际编码主题（如数据库操作和游戏设计）以及多样化的上下文依赖（例如字段、方法和库）。我们耗费360人时完成了向Java和C++的手动迁移，提供了完整的代码样本及相关测试套件。随后，我们为类级别代码翻译设计了三种翻译策略（即整体式、最小依赖式和独立式），并在ClassEval-T上评估了八个近期发布的商业、通用和代码专用LLMs，涵盖不同系列和规模。实验结果表明，与最广泛研究的方法级别代码翻译基准测试相比，性能出现显著下降，且LLMs之间表现出明显差异，证明了ClassEval-T在衡量近期LLMs方面的有效性。之后，我们进一步讨论了不同翻译策略的使用场景，以及LLMs在翻译类样本时的依赖感知能力。最后，本文对测试中表现最佳的LLM产生的1,243个失败案例进行了分析和分类，以提供实践指导和未来启示。