Retrieval augmented generation (RAG) models, which integrate large-scale pre-trained generative models with external retrieval mechanisms, have shown significant success in various natural language processing (NLP) tasks. However, applying RAG models in Persian language as a low-resource language, poses distinct challenges. These challenges primarily involve the preprocessing, embedding, retrieval, prompt construction, language modeling, and response evaluation of the system. In this paper, we address the challenges towards implementing a real-world RAG system for Persian language called PersianRAG. We propose novel solutions to overcome these obstacles and evaluate our approach using several Persian benchmark datasets. Our experimental results demonstrate the capability of the PersianRAG framework to enhance question answering task in Persian.
翻译:检索增强生成(RAG)模型通过将大规模预训练生成模型与外部检索机制相结合,已在多种自然语言处理(NLP)任务中取得显著成功。然而,将RAG模型应用于波斯语这类低资源语言时,面临着独特的挑战。这些挑战主要涉及系统的预处理、嵌入、检索、提示构建、语言建模及响应评估等方面。本文针对实现一个名为波斯RAG的波斯语实际应用RAG系统所面临的挑战展开研究。我们提出了创新的解决方案以克服这些障碍,并使用多个波斯语基准数据集对方法进行了评估。实验结果表明,波斯RAG框架能够有效提升波斯语问答任务的性能。