TransLibEval: Demystify Large Language Models' Capability in Third-party Library-targeted Code Translation

Pengyu Xue,Kunwu Zheng,Zhen Yang,Yifei Pei,Linhao Wu,Jiahui Dong,Xiapu Luo,Yan Xiao,Fei Liu,Yuxuan Zhang,Xiran Lyu,Xianhang Li,Xuanyu Zhu,Chengyi Wang

from arxiv, 24 pages, 5 figures, accepted by FSE 2026 (The ACM International Conference on the Foundations of Software Engineering)

In recent years, Large Language Models (LLMs) have been widely studied in the code translation field on the method, class, and even repository levels. However, most of these benchmarks are limited in terms of Third-Party Library (TPL) categories and scales, making TPL-related errors hard to expose and hindering the development of targeted solutions. Considering the high dependence (over 90%) on TPLs in practical programming, demystifying and analyzing LLMs' code translation performance involving various TPLs becomes imperative. To address this gap, we construct TransLibEval, the first benchmark dedicated to library-centric code translation. It consists of 200 real-world tasks across Python, Java, and C++, each explicitly involving TPLs from diverse categories such as data processing, machine learning, and web development, with comprehensive dependency coverage and high-coverage test suites. We evaluate seven recent LLMs of commercial, general, and code-specialized families under six translation strategies of three categories: Direct, IR-guided, and Retrieval-augmented. Experimental results show a dramatic performance drop compared with library-free settings (average CA decline over 60%), while diverse strategies demonstrate heterogeneous advantages. Furthermore, we analyze 4,831 failed cases from GPT-4o, one of the State-of-the-Art (SOTA) LLMs, revealing numerous third-party reference errors that were obscured previously. These findings highlight the unique challenges of library-centric translation and provide practical guidance for improving TPL-aware code intelligence.

翻译：近年来，大语言模型（LLMs）在代码翻译领域得到了广泛研究，其范围涵盖方法、类乃至仓库级别。然而，现有基准测试在第三方库（TPL）类别与规模方面存在局限，导致与TPL相关的错误难以暴露，阻碍了针对性解决方案的发展。考虑到实际编程中对TPL的高度依赖（超过90%），系统揭示并分析LLMs在涉及各类TPL的代码翻译性能显得尤为迫切。为填补这一空白，我们构建了首个专注于以库为中心的代码翻译基准测试TransLibEval。该基准包含涵盖Python、Java和C++的200个真实世界任务，每个任务均明确涉及数据处理、机器学习和Web开发等不同类别的TPL，并具备完整的依赖覆盖与高覆盖率的测试套件。我们在三类六种翻译策略（直接翻译、中间表示引导翻译与检索增强翻译）下，评估了来自商业、通用及代码专用系列的七个最新LLMs。实验结果表明，与无库环境相比，模型性能出现显著下降（平均正确率下降超过60%），而不同策略展现出异构优势。此外，我们分析了当前最先进LLM之一GPT-4o的4,831个失败案例，揭示了大量先前被掩盖的第三方引用错误。这些发现凸显了以库为中心的翻译所面临的独特挑战，并为改进TPL感知的代码智能提供了实践指导。