MURAD: A Large-Scale Multi-Domain Unified Reverse Arabic Dictionary Dataset

Arabic is a linguistically and culturally rich language with a vast vocabulary that spans scientific, religious, and literary domains. Yet, large-scale lexical datasets linking Arabic words to precise definitions remain limited. We present MURAD (Multi-domain Unified Reverse Arabic Dictionary), an open lexical dataset with 96,243 word-definition pairs. The data come from trusted reference works and educational sources. Extraction used a hybrid pipeline integrating direct text parsing, optical character recognition, and automated reconstruction. This ensures accuracy and clarity. Each record aligns a target word with its standardized Arabic definition and metadata that identifies the source domain. The dataset covers terms from linguistics, Islamic studies, mathematics, physics, psychology, and engineering. It supports computational linguistics and lexicographic research. Applications include reverse dictionary modeling, semantic retrieval, and educational tools. By releasing this resource, we aim to advance Arabic natural language processing and promote reproducible research on Arabic lexical semantics.

翻译：阿拉伯语是一种语言和文化上极为丰富的语言，其庞大的词汇量涵盖科学、宗教和文学等多个领域。然而，将阿拉伯语词汇与精确定义联系起来的大规模词汇数据集仍然有限。我们提出了MURAD（多领域统一反向阿拉伯语词典），这是一个包含96,243个词-定义对的开源词汇数据集。数据来源于可信的参考著作和教育资源。数据提取采用了一个混合流程，集成了直接文本解析、光学字符识别和自动重建技术，从而确保了准确性和清晰度。每条记录都将目标词与其标准化的阿拉伯语定义以及标识来源领域的元数据对齐。该数据集涵盖了语言学、伊斯兰研究、数学、物理学、心理学和工程学等领域的术语。它支持计算语言学和词典编纂研究。其应用包括反向词典建模、语义检索和教育工具。通过发布这一资源，我们旨在推动阿拉伯语自然语言处理的发展，并促进阿拉伯语词汇语义学的可复现研究。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《多语言大型语言模型：系统综述》

专知会员服务

50+阅读 · 2024年11月21日

重磅！《大语言模型》新书出炉，人大出版，391页pdf

专知会员服务

201+阅读 · 2024年4月15日

158页《大型语言模型数据集》全面综述，444个数据集涵盖预训练、指令微调、偏好、评估等，附中英文版

专知会员服务

155+阅读 · 2024年3月1日

RAG+LLM=？同济大学等最新《大型语言模型的检索增强生成》综述

专知会员服务

111+阅读 · 2023年12月19日