Recent large-scale Spoken Language Understanding datasets focus predominantly on English and do not account for language-specific phenomena such as particular phonemes or words in different lects. We introduce ITALIC, the first large-scale speech dataset designed for intent classification in Italian. The dataset comprises 16,521 crowdsourced audio samples recorded by 70 speakers from various Italian regions and annotated with intent labels and additional metadata. We explore the versatility of ITALIC by evaluating current state-of-the-art speech and text models. Results on intent classification suggest that increasing scale and running language adaptation yield better speech models, monolingual text models outscore multilingual ones, and that speech recognition on ITALIC is more challenging than on existing Italian benchmarks. We release both the dataset and the annotation scheme to streamline the development of new Italian SLU models and language-specific datasets.
翻译:近年来大规模口语理解数据集主要集中在英语领域,且未考虑不同方言中的特定音素或词汇等语言特有现象。我们提出ITALIC——首个面向意大利语意图分类设计的大规模语音数据集。该数据集包含由来自意大利不同地区的70名说话者录制的16,521条众包音频样本,并标注了意图标签及额外元数据。通过评估当前最先进的语音和文本模型,我们探索了ITALIC的多功能性。意图分类结果表明:扩大模型规模与进行语言适应可提升语音模型性能;单语文本模型表现优于多语模型;对ITALIC进行语音识别比现有意大利语基准更具挑战性。我们同时公开数据集与标注方案,以促进新型意大利语SLU模型及语言特定数据集的开发。