Despite the recent advancements in speech recognition, there are still difficulties in accurately transcribing conversational and emotional speech in noisy and reverberant acoustic environments. This poses a particular challenge in the search and rescue (SAR) domain, where transcribing conversations among rescue team members is crucial to support real-time decision-making. The scarcity of speech data and associated background noise in SAR scenarios make it difficult to deploy robust speech recognition systems. To address this issue, we have created and made publicly available a German speech dataset called RescueSpeech. This dataset includes real speech recordings from simulated rescue exercises. Additionally, we have released competitive training recipes and pre-trained models. Our study highlights that the performance attained by state-of-the-art methods in this challenging scenario is still far from reaching an acceptable level.
翻译:尽管近期语音识别技术取得了进展,但在嘈杂及混响声学环境中准确转录对话性和情感性语音仍存在困难。这一问题在搜索与救援(SAR)领域尤为突出——转录救援团队成员间的对话对支持实时决策至关重要。由于SAR场景中语音数据的稀缺性及相关背景噪声的存在,部署稳健的语音识别系统面临挑战。为解决这一问题,我们创建并公开了名为RescueSpeech的德语语音数据集。该数据集包含来自模拟救援演练的真实语音录音。此外,我们还发布了具有竞争力的训练方案和预训练模型。研究表明,在此类挑战性场景下,当前最先进方法所达到的性能仍远未达到可接受水平。