We present ACADATA, a high-quality parallel dataset for academic translation, that consists of two subsets: ACAD-TRAIN, which contains approximately 1.5 million author-generated paragraph pairs across 96 language directions and ACAD-BENCH, a curated evaluation set of almost 6,000 translations covering 12 directions. To validate its utility, we fine-tune two Large Language Models (LLMs) on ACAD-TRAIN and benchmark them on ACAD-BENCH against specialized machine-translation systems, general-purpose, open-weight LLMs, and several large-scale proprietary models. Experimental results demonstrate that fine-tuning on ACAD-TRAIN leads to improvements in academic translation quality by +6.1 and +12.4 d-BLEU points on average for 7B and 2B models respectively, while also improving long-context translation in a general domain by up to 24.9% when translating out of English. The fine-tuned top-performing model surpasses the best propietary and open-weight models on academic translation domain. By releasing ACAD-TRAIN, ACAD-BENCH and the fine-tuned models, we provide the community with a valuable resource to advance research in academic domain and long-context translation.
翻译:我们提出了ACADATA,一个用于学术翻译的高质量并行数据集,它包含两个子集:ACAD-TRAIN(包含约150万条作者生成的段落对,涵盖96种语言方向)和ACAD-BENCH(一个精心构建的评估集,包含近6000条翻译,覆盖12个方向)。为验证其实用性,我们在ACAD-TRAIN上微调了两个大型语言模型(LLM),并在ACAD-BENCH上对它们进行基准测试,对比对象包括专用机器翻译系统、通用开源权重LLM以及多个大规模专有模型。实验结果表明,在ACAD-TRAIN上进行微调后,7B和2B模型在学术翻译质量上平均分别提升了+6.1和+12.4 d-BLEU分,同时在以英语为源语进行翻译时,通用领域的长上下文翻译性能最高提升了24.9%。微调后的最优模型在学术翻译领域超越了最佳专有模型和开源权重模型。通过发布ACAD-TRAIN、ACAD-BENCH及微调模型,我们为学术界提供了推动学术领域及长上下文翻译研究的重要资源。