The advancement of automatic speech recognition (ASR) has been largely enhanced by extensive datasets in high-resource languages, while languages such as Hungarian remain underrepresented due to limited spontaneous and conversational corpora. To address this gap, we introduce two new datasets -- BEA-Large and BEA-Dialogue -- constructed from the previously unprocessed portions of the Hungarian speech corpus named BEA. BEA-Large extends BEA-Base with 255 hours of spontaneous speech from 433 speakers, enriched with detailed segment-level metadata. BEA-Dialogue, comprising 85 hours of spontaneous conversations, is a Hungarian speech corpus featuring natural dialogues partitioned into speaker-independent subsets, supporting research in conversational ASR and speaker diarization. We establish reproducible baselines on these datasets using publicly available ASR models, with the fine-tuned Fast Conformer model achieving word error rates as low as 14.18% on spontaneous and 4.8% on repeated speech. Diarization experiments yield diarization error rates between 12.46% and 17.40%, providing reference points for future improvements. The results highlight the persistent difficulty of conversational ASR, particularly due to disfluencies, overlaps, and informal speech patterns. By releasing these datasets and baselines, we aim to advance Hungarian speech technology and offer a methodological framework for developing spontaneous and conversational benchmarks in other languages.
翻译:自动语音识别(ASR)的进展在很大程度上得益于高资源语言大规模数据集的推动,而如匈牙利语等语言由于缺乏自发性对话语料库,其发展仍显不足。为填补这一空白,我们基于匈牙利语语音语料库BEA中先前未处理的部分,构建了两个新数据集——BEA-Large与BEA-Dialogue。BEA-Large在BEA-Base基础上扩展了来自433位说话者的255小时自发语音,并补充了详细的片段级元数据。BEA-Dialogue包含85小时自发对话,是一个以自然对话为特色的匈牙利语语音语料库,其按说话人无关子集划分,可用于对话语音识别与说话人日志研究。我们使用公开可用的ASR模型在这些数据集上建立了可复现的基线,其中经微调的Fast Conformer模型在自发语音上实现了低至14.18%的词错误率,在重复语音上达到4.8%。说话人日志实验得到的日志错误率介于12.46%至17.40%之间,为后续改进提供了参考基准。这些结果凸显了对话语音识别,尤其是因不流利表达、语音重叠及非正式说话模式带来的持续挑战。通过公开这些数据集与基线,我们旨在推动匈牙利语语音技术的发展,并为其他语言构建自发与对话语音基准提供方法论框架。