We introduce LRS-VoxMM, an in-the-wild benchmark for audio-visual speech recognition (AVSR). The benchmark is derived from VoxMM, a dataset of diverse real-world spoken conversations with human-annotated transcriptions. We select AVSR-suitable samples and preprocess them in an LRS-style format for direct use in existing AVSR pipelines. Compared with commonly used benchmarks, LRS-VoxMM covers a more diverse range of scenarios and acoustic conditions. We also release distorted evaluation sets with additive noise, reverberation, and bandwidth limitation to support evaluation under severe acoustic degradation. Experimental results show that LRS-VoxMM is considerably harder than LRS3 and that the contribution of visual information becomes more evident as the audio signal degrades. LRS-VoxMM supports more realistic AVSR benchmarking and encourages further research on the role of visual information in challenging real-world conditions.
翻译:我们提出了LRS-VoxMM,这是一个针对野外环境下的音视频语音识别(AVSR)基准。该基准源自VoxMM数据集,其中包含多样化的真实世界口语对话及其人工标注的转录文本。我们筛选出适用于AVSR的样本,并以LRS格式进行预处理,使其可直接用于现有的AVSR处理流程。与常用基准相比,LRS-VoxMM覆盖了更多样的场景和声学条件。我们还发布了带有加性噪声、混响和带宽限制的失真评估集,以支持在严重声学退化条件下的评估。实验结果表明,LRS-VoxMM的难度显著高于LRS3,且随着音频信号的退化,视觉信息的贡献变得更加明显。LRS-VoxMM支持更真实的AVSR基准测试,并鼓励在具有挑战性的真实世界条件下进一步研究视觉信息的作用。