In recent years, Large Language Models (LLMs) have significantly improved automated code translation, often achieving over 80% accuracy on existing benchmarks. However, most of these benchmarks consist of short, standalone, algorithmic samples that do not reflect practical coding tasks. To address this gap, we introduce ClassEval-T, a class-level code translation benchmark designed to assess LLM performance on real-world coding scenarios. Built upon ClassEval, a class-level Python code generation benchmark covering topics such as database operations and game design, ClassEval-T extends into Java and C++ with complete code samples and test suites, requiring 360 person-hours for manual migration. We propose three translation strategies (holistic, min-dependency, and standalone) and evaluate six recent LLMs across various families and sizes on ClassEval-T. Results reveal a significant performance drop compared to method-level benchmarks, highlighting discrepancies among LLMs and demonstrating ClassEval-T's effectiveness. We further analyze LLMs' dependency awareness in translating class samples and categorize 1,397 failure cases by the best-performing LLM for practical insights and future improvement.
翻译:近年来,大型语言模型(LLM)在自动化代码翻译方面取得了显著进步,在现有基准测试中通常能达到超过80%的准确率。然而,这些基准测试大多由简短、独立、算法性的代码片段组成,未能反映实际的编码任务。为填补这一空白,我们引入了ClassEval-T,这是一个类级别的代码翻译基准测试,旨在评估LLM在真实世界编码场景中的性能。ClassEval-T基于ClassEval(一个涵盖数据库操作和游戏设计等主题的类级别Python代码生成基准测试)构建,将其扩展至Java和C++语言,提供了完整的代码样本和测试套件,其手动迁移过程耗费了360人时。我们提出了三种翻译策略(整体式、最小依赖式和独立式),并在ClassEval-T上评估了来自不同系列和规模的六个近期LLM。结果显示,与函数级别基准测试相比,LLM性能出现显著下降,这突显了不同LLM之间的差异,并证明了ClassEval-T的有效性。我们进一步分析了LLM在翻译类样本时的依赖关系感知能力,并根据性能最佳的LLM对1,397个失败案例进行了分类,以提供实践见解并为未来改进指明方向。