ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications

可理解性 · 语音识别 · 控制器 · 自动语音识别 · 清华大学智能产业研究院 ·

2023 年 6 月 15 日

翻译：ATCO2语料库：面向空中交通管制通信自动语音识别与自然语言理解研究的大规模数据集

Juan Zuluaga-Gomez,Karel Veselý,Igor Szöke,Alexander Blatt,Petr Motlicek,Martin Kocour,Mickael Rigault,Khalid Choukri,Amrutha Prasad,Seyyed Saeed Sarfjoo,Iuliia Nigmatulina,Claudia Cevenini,Pavel Kolčárek,Allan Tart,Jan Černocký,Dietrich Klakow

from arxiv, Manuscript under review; The code is available at: https://github.com/idiap/atco2-corpus

Personal assistants, automatic speech recognizers and dialogue understanding systems are becoming more critical in our interconnected digital world. A clear example is air traffic control (ATC) communications. ATC aims at guiding aircraft and controlling the airspace in a safe and optimal manner. These voice-based dialogues are carried between an air traffic controller (ATCO) and pilots via very-high frequency radio channels. In order to incorporate these novel technologies into ATC (low-resource domain), large-scale annotated datasets are required to develop the data-driven AI systems. Two examples are automatic speech recognition (ASR) and natural language understanding (NLU). In this paper, we introduce the ATCO2 corpus, a dataset that aims at fostering research on the challenging ATC field, which has lagged behind due to lack of annotated data. The ATCO2 corpus covers 1) data collection and pre-processing, 2) pseudo-annotations of speech data, and 3) extraction of ATC-related named entities. The ATCO2 corpus is split into three subsets. 1) ATCO2-test-set corpus contains 4 hours of ATC speech with manual transcripts and a subset with gold annotations for named-entity recognition (callsign, command, value). 2) The ATCO2-PL-set corpus consists of 5281 hours of unlabeled ATC data enriched with automatic transcripts from an in-domain speech recognizer, contextual information, speaker turn information, signal-to-noise ratio estimate and English language detection score per sample. Both available for purchase through ELDA at http://catalog.elra.info/en-us/repository/browse/ELRA-S0484. 3) The ATCO2-test-set-1h corpus is a one-hour subset from the original test set corpus, that we are offering for free at https://www.atco2.org/data. We expect the ATCO2 corpus will foster research on robust ASR and NLU not only in the field of ATC communications but also in the general research community.

翻译：个人助手、自动语音识别器及对话理解系统在我们互联的数字世界中变得愈发关键，空中交通管制通信便是典型范例。空中交通管制旨在以安全、最优的方式引导航空器并管控空域，这类基于语音的对话通过甚高频无线电频道在管制员与飞行员之间进行。为了将此类新型技术融入低资源领域的空中交通管制，需要大规模标注数据集以驱动基于数据的人工智能系统开发，自动语音识别与自然语言理解便是其中两个典型应用。本文介绍了ATCO2语料库——一个旨在促进空中交通管制（因缺乏标注数据而长期发展滞后）领域研究的挑战性数据集。该语料库涵盖：1）数据采集与预处理；2）语音数据的伪标注；3）空中交通管制相关命名实体提取。ATCO2语料库分为三个子集：1）ATCO2测试集语料库包含4小时带人工标注文本的空中交通管制语音数据，以及黄金标注的子集用于命名实体识别（呼号、指令、数值）；2）ATCO2-PL集语料库包含5281小时未标注的空中交通管制数据，并附带领域内语音识别器的自动转录结果、上下文信息、说话人轮次信息、信噪比估计值及每条样本的英语语言检测分数。以上两个子集可通过ELDA购买（http://catalog.elra.info/en-us/repository/browse/ELRA-S0484）；3）ATCO2测试集-1小时语料库是原始测试集语料库中截取的1小时子集，我们在https://www.atco2.org/data免费提供。我们期望ATCO2语料库不仅能够推动空中交通管制通信领域，更能促进整个学术研究领域在鲁棒性自动语音识别与自然语言理解方面的研究。