Recently, Retrieval Augmented Generation (RAG) has emerged as a powerful technique in natural language processing, combining the strengths of retrieval-based and generation-based models to enhance text generation tasks. However, the application of RAG in Arabic, a language with unique characteristics and resource constraints, remains underexplored. This paper presents a comprehensive case study on the implementation and evaluation of RAG for Arabic text. The work focuses on exploring various semantic embedding models in the retrieval stage and several LLMs in the generation stage, in order to investigate what works and what doesn't in the context of Arabic. The work also touches upon the issue of variations between document dialect and query dialect in the retrieval stage. Results show that existing semantic embedding models and LLMs can be effectively employed to build Arabic RAG pipelines.
翻译:近年来,检索增强生成(RAG)已成为自然语言处理领域一项强大的技术,它结合了基于检索和基于生成模型的优势,以增强文本生成任务。然而,RAG在阿拉伯语中的应用——该语言具有独特的特征和资源限制——仍未得到充分探索。本文提出了一个关于阿拉伯语文本RAG实现与评估的全面案例研究。工作重点在于探索检索阶段的各种语义嵌入模型以及生成阶段的若干LLM,以探究在阿拉伯语语境下哪些方法有效、哪些无效。研究还涉及了检索阶段中文档方言与查询方言之间的差异问题。结果表明,现有的语义嵌入模型和LLM可以被有效地用于构建阿拉伯语RAG流程。