FAME-MT Dataset: Formality Awareness Made Easy for Machine Translation Purposes

People use language for various purposes. Apart from sharing information, individuals may use it to express emotions or to show respect for another person. In this paper, we focus on the formality level of machine-generated translations and present FAME-MT -- a dataset consisting of 11.2 million translations between 15 European source languages and 8 European target languages classified to formal and informal classes according to target sentence formality. This dataset can be used to fine-tune machine translation models to ensure a given formality level for each European target language considered. We describe the dataset creation procedure, the analysis of the dataset's quality showing that FAME-MT is a reliable source of language register information, and we present a publicly available proof-of-concept machine translation model that uses the dataset to steer the formality level of the translation. Currently, it is the largest dataset of formality annotations, with examples expressed in 112 European language pairs. The dataset is published online: https://github.com/laniqo-public/fame-mt/ .

翻译：人们使用语言时出于多种目的。除了信息交流，个体还可能借助语言表达情感或向他人表示尊重。本文聚焦于机器生成译文的形式化程度，提出FAME-MT数据集——该数据集包含1112万条翻译样本，覆盖15种欧洲源语言与8种欧洲目标语言，并根据目标语句的形式化程度划分为正式与非正式两类。该数据集可用于微调机器翻译模型，以确保所考虑的每种欧洲目标语言达到指定的形式化水平。我们描述了数据集创建流程及其质量分析结果，证明FAME-MT是语言语域信息的可靠来源，并公开了一个基于该数据集调控译文形式化水平的概念验证机器翻译模型。目前，该数据集是规模最大的形式化标注数据集，包含112种欧洲语言对的示例。数据集在线发布：https://github.com/laniqo-public/fame-mt/

相关内容

Machine Translation

关注 210

机器翻译（Machine Translation）涵盖计算语言学和语言工程的所有分支，包含多语言方面。特色论文涵盖理论，描述或计算方面的任何下列主题:双语和多语语料库的编写和使用，计算机辅助语言教学，非罗马字符集的计算含义，连接主义翻译方法，对比语言学等。官网地址：http://dblp.uni-trier.de/db/journals/mt/

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日