For the task of speech recognition, the use of more than 30 seconds of acoustic context during training is uncommon and under-investigated in literature. In this work, we conduct an empirical study on the effect of scaling the sequence length used to train/evaluate (dense-attention-based) acoustic models on speech recognition performance. For these experiments, a dataset of roughly 100,000 pseudo-labelled Spotify podcasts is used, with context lengths of 5 seconds to 1 hour being explored. Zero-shot evaluations are presented on the long-format datasets: Earnings-22, Tedlium and Rev16. Results demonstrate a benefit from training with up to 21.8 minutes of acoustic context, showing up to a 14.5\% relative improvement from a baseline trained with 10 seconds of context. We find that the model's width/depth, positional encoding scheme and number of attention heads impact its ability to use longer contexts.
翻译:在语音识别任务中,训练时使用超过30秒的声学上下文并不常见,且现有文献对此研究不足。本研究通过实证方法,探讨了在训练/评估基于密集注意力的声学模型时,序列长度缩放对语音识别性能的影响。实验采用约10万个带有伪标签的Spotify播客数据集,探索了5秒至1小时不等的上下文长度。在长时数据集Earnings-22、Tedlium和Rev16上进行了零样本评估。结果表明,使用长达21.8分钟的声学上下文进行训练具有显著优势,相比仅使用10秒上下文的基线模型,相对性能提升最高达14.5%。研究发现,模型的宽度/深度、位置编码方案以及注意力头数量均会影响其利用长上下文的能力。