In this paper, we present a transcribed corpus of the LIBE committee of the EU parliament, totalling 3.6 Million running words. The meetings of parliamentary committees of the EU are a potentially valuable source of information for political scientists but the data is not readily available because only disclosed as speech recordings together with limited metadata. The meetings are in English, partly spoken by non-native speakers, and partly spoken by interpreters. We investigated the most appropriate Automatic Speech Recognition (ASR) model to create an accurate text transcription of the audio recordings of the meetings in order to make their content available for research and analysis. We focused on the unsupervised domain adaptation of the ASR pipeline. Building on the transformer-based Wav2vec2.0 model, we experimented with multiple acoustic models, language models and the addition of domain-specific terms. We found that a domain-specific acoustic model and a domain-specific language model give substantial improvements to the ASR output, reducing the word error rate (WER) from 28.22 to 17.95. The use of domain-specific terms in the decoding stage did not have a positive effect on the quality of the ASR in terms of WER. Initial topic modelling results indicated that the corpus is useful for downstream analysis tasks. We release the resulting corpus and our analysis pipeline for future research.
翻译:本文介绍了欧盟议会LIBE委员会的转录语料库,总计包含360万词。欧盟议会委员会的会议记录对政治学家而言是潜在宝贵的信息来源,但由于仅以录音及有限元数据形式公开,数据并未直接可用。会议以英语进行,部分由非母语者发言,部分由口译员转述。我们研究了最合适的自动语音识别(ASR)模型,以精确转录会议音频录音,使其内容可供研究与分析。我们聚焦于ASR流程的无监督领域自适应。基于Transformer架构的Wav2vec2.0模型,我们实验了多种声学模型、语言模型以及领域特定术语的添加。研究发现,领域特定的声学模型和语言模型显著提升了ASR输出性能,将词错误率(WER)从28.22降至17.95。在解码阶段加入领域特定术语对ASR质量(以WER衡量)未产生正面影响。初步主题建模结果表明,该语料库适用于下游分析任务。我们公开发布此语料库及分析流程,以供未来研究使用。