We present HebDB, a weakly supervised dataset for spoken language processing in the Hebrew language. HebDB offers roughly 2500 hours of natural and spontaneous speech recordings in the Hebrew language, consisting of a large variety of speakers and topics. We provide raw recordings together with a pre-processed, weakly supervised, and filtered version. The goal of HebDB is to further enhance research and development of spoken language processing tools for the Hebrew language. Hence, we additionally provide two baseline systems for Automatic Speech Recognition (ASR): (i) a self-supervised model; and (ii) a fully supervised model. We present the performance of these two methods optimized on HebDB and compare them to current multi-lingual ASR alternatives. Results suggest the proposed method reaches better results than the evaluated baselines considering similar model sizes. Dataset, code, and models are publicly available under https://pages.cs.huji.ac.il/adiyoss-lab/HebDB/.
翻译:本文介绍HebDB,一个用于希伯来语口语处理的弱监督数据集。HebDB提供约2500小时的希伯来语自然自发语音录音,涵盖广泛的说话者和话题。我们同时提供原始录音及经过预处理、弱监督和过滤的版本。HebDB的目标是进一步推动希伯来语口语处理工具的研究与开发。为此,我们还提供了两个自动语音识别(ASR)基线系统:(i)自监督模型;(ii)全监督模型。我们展示了这两种方法在HebDB上优化的性能,并与当前的多语言ASR替代方案进行比较。结果表明,在模型规模相近的情况下,所提方法取得了优于评估基线的结果。数据集、代码和模型已在https://pages.cs.huji.ac.il/adiyoss-lab/HebDB/公开提供。