It remains a question that how simultaneous interpretation (SI) data affects simultaneous machine translation (SiMT). Research has been limited due to the lack of a large-scale training corpus. In this work, we aim to fill in the gap by introducing NAIST-SIC-Aligned, which is an automatically-aligned parallel English-Japanese SI dataset. Starting with a non-aligned corpus NAIST-SIC, we propose a two-stage alignment approach to make the corpus parallel and thus suitable for model training. The first stage is coarse alignment where we perform a many-to-many mapping between source and target sentences, and the second stage is fine-grained alignment where we perform intra- and inter-sentence filtering to improve the quality of aligned pairs. To ensure the quality of the corpus, each step has been validated either quantitatively or qualitatively. This is the first open-sourced large-scale parallel SI dataset in the literature. We also manually curated a small test set for evaluation purposes. We hope our work advances research on SI corpora construction and SiMT. Please find our data at \url{https://github.com/mingzi151/AHC-SI}.
翻译:同声传译(SI)数据如何影响同声机器翻译(SiMT)仍是一个待解问题。由于缺乏大规模训练语料库,相关研究一直受限。本文旨在填补这一空白,提出NAIST-SIC-Aligned——一个自动对齐的英日平行同声传译数据集。基于非对齐语料库NAIST-SIC,我们提出两阶段对齐方法以构建平行语料库,使其适用于模型训练。第一阶段为粗粒度对齐,在源句与目标句之间建立多对多映射;第二阶段为细粒度对齐,通过句内及跨句过滤提升对齐对质量。为确保语料库质量,每个步骤均经定量或定性验证。这是文献中首个开源的大规模平行同声传译数据集。我们还人工整理了一个小型测试集用于评估。希望本研究能推进同声传译语料库构建及同声机器翻译研究。数据请见:\url{https://github.com/mingzi151/AHC-SI}