Task-oriented dialogue research has mainly focused on a few popular languages like English and Chinese, due to the high dataset creation cost for a new language. To reduce the cost, we apply manual editing to automatically translated data. We create a new multilingual benchmark, X-RiSAWOZ, by translating the Chinese RiSAWOZ to 4 languages: English, French, Hindi, Korean; and a code-mixed English-Hindi language. X-RiSAWOZ has more than 18,000 human-verified dialogue utterances for each language, and unlike most multilingual prior work, is an end-to-end dataset for building fully-functioning agents. The many difficulties we encountered in creating X-RiSAWOZ led us to develop a toolset to accelerate the post-editing of a new language dataset after translation. This toolset improves machine translation with a hybrid entity alignment technique that combines neural with dictionary-based methods, along with many automated and semi-automated validation checks. We establish strong baselines for X-RiSAWOZ by training dialogue agents in the zero- and few-shot settings where limited gold data is available in the target language. Our results suggest that our translation and post-editing methodology and toolset can be used to create new high-quality multilingual dialogue agents cost-effectively. Our dataset, code, and toolkit are released open-source.
翻译:摘要:任务导向型对话研究主要集中于少数几种流行语言,如英语和中文,这主要是因为为新语言创建数据集的成本高昂。为降低成本,我们对自动翻译后的数据进行了人工编辑。通过将中文RiSAWOZ数据集翻译为四种语言(英语、法语、印地语、韩语)以及一种英语-印地语混合代码语言,我们创建了新的多语言基准数据集X-RiSAWOZ。X-RiSAWOZ为每种语言提供了超过18,000条经过人工验证的对话语句,且与大多数多语言先前工作不同,它是一个用于构建完全功能型智能体的端到端数据集。在创建X-RiSAWOZ过程中遇到的诸多困难促使我们开发了一套工具集,以加速翻译后新语言数据集的后编辑工作。该工具集通过结合神经网络与基于字典的混合实体对齐技术改进了机器翻译,并辅以多项自动化与半自动化验证检查。我们通过在零样本和小样本(目标语言中仅有少量黄金标准数据可用)设定下训练对话智能体,为X-RiSAWOZ建立了强基线。实验结果表明,我们的翻译与后编辑方法及工具集可用于高效创建高质量多语言对话智能体。我们的数据集、代码及工具集均已开源发布。