Recent advancements in large language models (LLMs) have demonstrated impressive capabilities in code translation, typically evaluated using benchmarks like CodeTransOcean and RepoTransBench. However, dependency-free benchmarks fail to capture real-world complexities by focusing primarily on simple function-level translations and overlooking repository-level context (e.g., dependencies). Full-repository translation benchmarks significantly exceed the current capabilities of existing models, resulting in performance bottlenecks that fail to provide actionable insights for guiding model development. Furthermore, existing benchmarks do not account for the scenario of incrementally translating new or modified modules from the source to the target language, which demands careful handling of repository-level contexts such as dependencies, cross-module references, and architectural divergence. Moreover, LLMs' effectiveness in translating to newer, low-resource languages like Rust remains largely underexplored. To address these gaps, we introduce RustRepoTrans, the first repository-level context code translation benchmark targeting incremental translation, comprising 375 tasks translating into Rust from C, Java, and Python. Using this benchmark, we evaluate seven representative LLMs, analyzing their errors to assess limitations in complex translation scenarios. Among them, DeepSeek-R1 performs best with 51.5% Pass@1, excelling in both basic functionality and additional translation abilities, such as noise robustness and syntactical difference identification. However, even DeepSeek-R1 experiences a 22.2% performance drop (Pass@1 from 73.7% to 51.5%) when handling repository-level context compared to previous benchmarks without such context.
翻译:近年来,大型语言模型(LLM)在代码翻译方面展现出令人瞩目的能力,其评估通常依赖于CodeTransOcean和RepoTransBench等基准。然而,现有依赖无关的基准主要关注简单的函数级翻译,忽略了仓库级上下文(如依赖关系),因而未能捕捉真实场景的复杂性。全仓库翻译基准则远超现有模型的能力范围,导致性能瓶颈,无法为模型发展提供有效指导。此外,现有基准未考虑从源语言到目标语言增量翻译新增或修改模块的场景,该场景需谨慎处理依赖关系、跨模块引用和架构差异等仓库级上下文。同时,LLM在翻译至Rust等新兴低资源语言时的有效性仍缺乏充分探索。为填补这些空白,我们提出了RustRepoTrans——首个面向增量翻译的仓库级上下文代码翻译基准,包含从C、Java和Python翻译至Rust的375项任务。基于该基准,我们评估了七个代表性LLM,通过分析其错误以评估复杂翻译场景中的局限性。其中,DeepSeek-R1以51.5%的Pass@1率表现最佳,在基础功能和附加翻译能力(如噪声鲁棒性与语法差异识别)方面均表现突出。然而,与先前无上下文基准相比,DeepSeek-R1在处理仓库级上下文时仍出现22.2%的性能下降(Pass@1率从73.7%降至51.5%)。