We present Quran MD, a comprehensive multimodal dataset of the Quran that integrates textual, linguistic, and audio dimensions at the verse and word levels. For each verse (ayah), the dataset provides its original Arabic text, English translation, and phonetic transliteration. To capture the rich oral tradition of Quranic recitation, we include verse-level audio from 32 distinct reciters, reflecting diverse recitation styles and dialectical nuances. At the word level, each token is paired with its corresponding Arabic script, English translation, transliteration, and an aligned audio recording, allowing fine-grained analysis of pronunciation, phonology, and semantic context. This dataset supports various applications, including natural language processing, speech recognition, text-to-speech synthesis, linguistic analysis, and digital Islamic studies. Bridging text and audio modalities across multiple reciters, this dataset provides a unique resource to advance computational approaches to Quranic recitation and study. Beyond enabling tasks such as ASR, tajweed detection, and Quranic TTS, it lays the foundation for multimodal embeddings, semantic retrieval, style transfer, and personalized tutoring systems that can support both research and community applications. The dataset is available at https://huggingface.co/datasets/Buraaq/quran-audio-text-dataset
翻译:我们提出了Quran-MD,这是一个全面的古兰经多模态数据集,在经文和单词级别整合了文本、语言学和音频维度。对于每一节经文(ayah),数据集提供其原始阿拉伯语文本、英语翻译和音标转写。为了捕捉古兰经吟诵丰富的口述传统,我们包含了来自32位不同诵经者的经文级音频,反映了多样化的吟诵风格和方言细微差别。在单词级别,每个词元都与其对应的阿拉伯文书写、英语翻译、音标转写以及对齐的音频记录配对,从而支持对发音、音系学和语义上下文进行细粒度分析。该数据集支持多种应用,包括自然语言处理、语音识别、文本到语音合成、语言学分析和数字伊斯兰研究。通过桥接多位诵经者的文本和音频模态,该数据集为推进古兰经吟诵与研究的计算方法提供了独特的资源。除了支持诸如自动语音识别(ASR)、泰吉威德规则检测和古兰经文语合成等任务外,它还为多模态嵌入、语义检索、风格迁移以及个性化辅导系统奠定了基础,这些系统可同时支持研究和社区应用。数据集可通过 https://huggingface.co/datasets/Buraaq/quran-audio-text-dataset 获取。