Instruction tuning has emerged as a prominent methodology for teaching Large Language Models (LLMs) to follow instructions. However, current instruction datasets predominantly cater to English or are derived from English-dominated LLMs, resulting in inherent biases toward Western culture. This bias significantly impacts the linguistic structures of non-English languages such as Arabic, which has a distinct grammar reflective of the diverse cultures across the Arab region. This paper addresses this limitation by introducing CIDAR: https://hf.co/datasets/arbml/CIDAR, the first open Arabic instruction-tuning dataset culturally-aligned by human reviewers. CIDAR contains 10,000 instruction and output pairs that represent the Arab region. We discuss the cultural relevance of CIDAR via the analysis and comparison to other models fine-tuned on other datasets. Our experiments show that CIDAR can help enrich research efforts in aligning LLMs with the Arabic culture. All the code is available at https://github.com/ARBML/CIDAR.
翻译:指令微调已成为教授大型语言模型遵循指令的主流方法。然而,当前指令数据集主要服务于英语或源自英语主导的大型语言模型,导致对西方文化存在固有偏见。这种偏见显著影响了阿拉伯语等非英语语言的语法结构——阿拉伯语具有独特的语法体系,能够折射阿拉伯地区多元的文化特征。本文通过引入CIDAR(https://hf.co/datasets/arbml/CIDAR)——首个由人工评审者进行文化对齐的开放式阿拉伯语指令微调数据集,解决了这一局限。该数据集包含10,000条指令与输出配对样本,全面覆盖阿拉伯地区文化特征。我们通过分析对比其他数据集微调模型,论证了CIDAR的文化相关性。实验表明,CIDAR能够助力大型语言模型与阿拉伯文化的对齐研究。所有代码已开源至https://github.com/ARBML/CIDAR。