Quran-MD: A Fine-Grained Multilingual Multimodal Dataset of the Quran

We present Quran MD, a comprehensive multimodal dataset of the Quran that integrates textual, linguistic, and audio dimensions at the verse and word levels. For each verse (ayah), the dataset provides its original Arabic text, English translation, and phonetic transliteration. To capture the rich oral tradition of Quranic recitation, we include verse-level audio from 32 distinct reciters, reflecting diverse recitation styles and dialectical nuances. At the word level, each token is paired with its corresponding Arabic script, English translation, transliteration, and an aligned audio recording, allowing fine-grained analysis of pronunciation, phonology, and semantic context. This dataset supports various applications, including natural language processing, speech recognition, text-to-speech synthesis, linguistic analysis, and digital Islamic studies. Bridging text and audio modalities across multiple reciters, this dataset provides a unique resource to advance computational approaches to Quranic recitation and study. Beyond enabling tasks such as ASR, tajweed detection, and Quranic TTS, it lays the foundation for multimodal embeddings, semantic retrieval, style transfer, and personalized tutoring systems that can support both research and community applications. The dataset is available at https://huggingface.co/datasets/Buraaq/quran-audio-text-dataset

翻译：我们提出了Quran-MD，这是一个全面的古兰经多模态数据集，在经文和单词级别整合了文本、语言学和音频维度。对于每一节经文（ayah），数据集提供其原始阿拉伯语文本、英语翻译和音标转写。为了捕捉古兰经吟诵丰富的口述传统，我们包含了来自32位不同诵经者的经文级音频，反映了多样化的吟诵风格和方言细微差别。在单词级别，每个词元都与其对应的阿拉伯文书写、英语翻译、音标转写以及对齐的音频记录配对，从而支持对发音、音系学和语义上下文进行细粒度分析。该数据集支持多种应用，包括自然语言处理、语音识别、文本到语音合成、语言学分析和数字伊斯兰研究。通过桥接多位诵经者的文本和音频模态，该数据集为推进古兰经吟诵与研究的计算方法提供了独特的资源。除了支持诸如自动语音识别（ASR）、泰吉威德规则检测和古兰经文语合成等任务外，它还为多模态嵌入、语义检索、风格迁移以及个性化辅导系统奠定了基础，这些系统可同时支持研究和社区应用。数据集可通过 https://huggingface.co/datasets/Buraaq/quran-audio-text-dataset 获取。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

深度多模态数据融合

专知会员服务

55+阅读 · 2024年11月9日

从数据中心视角看多模态大型语言模型的综述

专知会员服务

58+阅读 · 2024年5月28日

158页《大型语言模型数据集》全面综述，444个数据集涵盖预训练、指令微调、偏好、评估等，附中英文版

专知会员服务

155+阅读 · 2024年3月1日

首个中文版大语言模型综述来了！人大发布60页《大语言模型综述》中文版，详述大模型技术细节

专知会员服务

264+阅读 · 2023年8月4日