ACADATA: Parallel Dataset of Academic Data for Machine Translation

We present ACADATA, a high-quality parallel dataset for academic translation, that consists of two subsets: ACAD-TRAIN, which contains approximately 1.5 million author-generated paragraph pairs across 96 language directions and ACAD-BENCH, a curated evaluation set of almost 6,000 translations covering 12 directions. To validate its utility, we fine-tune two Large Language Models (LLMs) on ACAD-TRAIN and benchmark them on ACAD-BENCH against specialized machine-translation systems, general-purpose, open-weight LLMs, and several large-scale proprietary models. Experimental results demonstrate that fine-tuning on ACAD-TRAIN leads to improvements in academic translation quality by +6.1 and +12.4 d-BLEU points on average for 7B and 2B models respectively, while also improving long-context translation in a general domain by up to 24.9% when translating out of English. The fine-tuned top-performing model surpasses the best propietary and open-weight models on academic translation domain. By releasing ACAD-TRAIN, ACAD-BENCH and the fine-tuned models, we provide the community with a valuable resource to advance research in academic domain and long-context translation.

翻译：我们提出了ACADATA，一个用于学术翻译的高质量并行数据集，它包含两个子集：ACAD-TRAIN（包含约150万条作者生成的段落对，涵盖96种语言方向）和ACAD-BENCH（一个精心构建的评估集，包含近6000条翻译，覆盖12个方向）。为验证其实用性，我们在ACAD-TRAIN上微调了两个大型语言模型（LLM），并在ACAD-BENCH上对它们进行基准测试，对比对象包括专用机器翻译系统、通用开源权重LLM以及多个大规模专有模型。实验结果表明，在ACAD-TRAIN上进行微调后，7B和2B模型在学术翻译质量上平均分别提升了+6.1和+12.4 d-BLEU分，同时在以英语为源语进行翻译时，通用领域的长上下文翻译性能最高提升了24.9%。微调后的最优模型在学术翻译领域超越了最佳专有模型和开源权重模型。通过发布ACAD-TRAIN、ACAD-BENCH及微调模型，我们为学术界提供了推动学术领域及长上下文翻译研究的重要资源。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日