Arabic is a linguistically and culturally rich language with a vast vocabulary that spans scientific, religious, and literary domains. Yet, large-scale lexical datasets linking Arabic words to precise definitions remain limited. We present MURAD (Multi-domain Unified Reverse Arabic Dictionary), an open lexical dataset with 96,243 word-definition pairs. The data come from trusted reference works and educational sources. Extraction used a hybrid pipeline integrating direct text parsing, optical character recognition, and automated reconstruction. This ensures accuracy and clarity. Each record aligns a target word with its standardized Arabic definition and metadata that identifies the source domain. The dataset covers terms from linguistics, Islamic studies, mathematics, physics, psychology, and engineering. It supports computational linguistics and lexicographic research. Applications include reverse dictionary modeling, semantic retrieval, and educational tools. By releasing this resource, we aim to advance Arabic natural language processing and promote reproducible research on Arabic lexical semantics.
翻译:阿拉伯语是一种语言和文化上极为丰富的语言,其庞大的词汇量涵盖科学、宗教和文学等多个领域。然而,将阿拉伯语词汇与精确定义联系起来的大规模词汇数据集仍然有限。我们提出了MURAD(多领域统一反向阿拉伯语词典),这是一个包含96,243个词-定义对的开源词汇数据集。数据来源于可信的参考著作和教育资源。数据提取采用了一个混合流程,集成了直接文本解析、光学字符识别和自动重建技术,从而确保了准确性和清晰度。每条记录都将目标词与其标准化的阿拉伯语定义以及标识来源领域的元数据对齐。该数据集涵盖了语言学、伊斯兰研究、数学、物理学、心理学和工程学等领域的术语。它支持计算语言学和词典编纂研究。其应用包括反向词典建模、语义检索和教育工具。通过发布这一资源,我们旨在推动阿拉伯语自然语言处理的发展,并促进阿拉伯语词汇语义学的可复现研究。