The reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks. Existing resources often suffer from semantic drift and context loss, which can lead to misleading performance metrics. In this work, we present a fully automated framework designed to address these challenges by enabling scalable, high-quality translation of datasets and benchmarks. We demonstrate that adapting test-time compute scaling strategies, specifically Universal Self-Improvement (USI) and our proposed multi-round ranking method, T-RANK, allows for significantly higher quality outputs compared to traditional pipelines. Our framework ensures that benchmarks preserve their original task structure and linguistic nuances during localization. We apply this approach to translate popular benchmarks and datasets into eight Eastern and Southern European languages (Ukrainian, Bulgarian, Slovak, Romanian, Lithuanian, Estonian, Turkish, Greek). Evaluations using both reference-based metrics and LLM-as-a-judge show that our translations surpass existing resources, resulting in more accurate downstream model assessment. We release both the framework and the improved benchmarks to facilitate robust and reproducible multilingual AI development.
翻译:当前,多语言大语言模型评估的可靠性因翻译基准测试的质量参差不齐而受到损害。现有资源常存在语义漂移和语境丢失问题,这可能导致误导性的性能指标。在本工作中,我们提出了一个全自动框架,旨在通过实现可扩展、高质量的数据集和基准测试翻译来应对这些挑战。我们证明,采用测试时计算扩展策略——特别是通用自我改进方法以及我们提出的多轮排序方法T-RANK——相比传统流程能显著提升输出质量。我们的框架确保基准测试在本地化过程中保留其原始任务结构和语言细微差别。我们将此方法应用于将流行的基准测试和数据集翻译成八种东欧及南欧语言(乌克兰语、保加利亚语、斯洛伐克语、罗马尼亚语、立陶宛语、爱沙尼亚语、土耳其语、希腊语)。使用基于参考的指标和LLM-as-a-judge进行的评估表明,我们的翻译超越了现有资源,从而实现了更准确的下游模型评估。我们同时发布了该框架和改进后的基准测试,以促进稳健且可复现的多语言人工智能发展。