End-to-end speech Named Entity Recognition (NER) aims to directly extract entities from speech. Prior work has shown that end-to-end (E2E) approaches can outperform cascaded pipelines for English, French, and Chinese, but Arabic remains under-explored due to its morphological complexity, the absence of short vowels, and limited annotated resources. We introduce CV-18 NER, the first publicly available dataset for NER from Arabic speech, created by augmenting the Arabic Common Voice 18 corpus with manual NER annotations following the fine-grained Wojood schema (21 entity types). We benchmark both pipeline systems (ASR + text NER) and E2E models based on Whisper and AraBEST-RQ. E2E systems substantially outperform the best pipeline configuration on the test set, reaching 37.0% CoER (AraBEST-RQ 300M) and 38.0% CVER (Whisper-medium). Further analysis shows that Arabic-specific self-supervised pretraining yields strong ASR performance, while multilingual weak supervision transfers more effectively to joint speech-to-entity learning, and that larger models may be harder to adapt in this low-resource setting. Our dataset and models are publicly released, providing the first open benchmark for end-to-end named entity recognition from Arabic speech https://huggingface.co/datasets/Elyadata/CV18-NER.
翻译:端到端语音命名实体识别(NER)旨在从语音中直接提取实体。先前研究表明,端到端方法在英语、法语和中文任务中优于级联流水线,但由于阿拉伯语形态复杂、缺乏短元音以及标注资源有限,该语言仍未被充分探索。我们提出CV-18 NER,这是首个公开的阿拉伯语音NER数据集,通过采用细粒度Wojood标注模式(21种实体类型)对阿拉伯语Common Voice 18语料库进行手工NER标注而构建。我们分别对基于Whisper和AraBEST-RQ的流水线系统(ASR+文本NER)与端到端模型进行了基准测试。端到端系统在测试集上显著优于最佳流水线配置,分别达到37.0% CoER(AraBEST-RQ 300M)和38.0% CVER(Whisper-medium)。进一步分析表明,阿拉伯语专用自监督预训练能实现优异ASR性能,而多语言弱监督在联合语音到实体学习任务中的迁移效果更佳;同时,在此低资源场景下,更大规模模型可能更难适应。本数据集与模型已公开发布,为阿拉伯语音端到端命名实体识别提供了首个开放基准:https://huggingface.co/datasets/Elyadata/CV18-NER