Beyond the Utterance: An Empirical Study of Very Long Context Speech Recognition

Automatic speech recognition (ASR) models are normally trained to operate over single utterances, with a short duration of less than 30 seconds. This choice has been made in part due to computational constraints, but also reflects a common, but often inaccurate, modelling assumption that treats utterances as independent and identically distributed samples. When long-format audio recordings are available, to work with such systems, these recordings must first be segmented into short utterances and processed independently. In this work, we show that due to recent algorithmic and hardware advances, this is no longer necessary, and current attention-based approaches can be used to train ASR systems that operate on sequences of over an hour in length. Therefore, to gain a better understanding of the relationship between the training/evaluation sequence length and performance, we train ASR models on large-scale data using 10 different sequence lengths from 10 seconds up to 1 hour. The results show a benefit from using up to 21.8 minutes of context, with up to a 14.2% relative improvement from a short context baseline in our primary experiments. Through modifying various architectural components, we find that the method of encoding positional information and the model's width/depth are important factors when working with long sequences. Finally, a series of evaluations using synthetic data are constructed to help analyse the model's use of context. From these results, it is clear that both linguistic and acoustic aspects of the distant context are being used by the model.

翻译：自动语音识别（ASR）模型通常被训练用于处理单句语音，其时长通常较短（少于30秒）。这一选择部分源于计算资源的限制，同时也反映了一种常见但往往不准确的建模假设，即假设各语音片段是独立同分布的样本。当可获得长格式音频录音时，为了适配此类系统，必须首先将录音分割为短句并独立处理。在本研究中，我们证明，由于近期的算法和硬件进步，这种处理方式已不再必要；当前基于注意力机制的方法可用于训练能够处理长度超过一小时的语音序列的ASR系统。因此，为了更好地理解训练/评估序列长度与性能之间的关系，我们在大规模数据上使用10种不同的序列长度（从10秒到1小时）训练了ASR模型。实验结果表明，使用长达21.8分钟的上下文信息具有增益，在我们的主要实验中，相较于短上下文基线，相对性能提升最高可达14.2%。通过修改多种架构组件，我们发现位置信息的编码方法以及模型的宽度/深度是处理长序列时的重要因素。最后，我们构建了一系列基于合成数据的评估实验，以帮助分析模型对上下文信息的利用。从这些结果可以明确看出，模型确实利用了远距离上下文中的语言学和声学信息。