This paper examines the specific obstacles of constructing Retrieval-Augmented Generation(RAG) systems in low-resource languages, with a focus on Persian's complicated morphology and versatile syntax. The research aims to improve retrieval and generation accuracy by introducing Persian-specific models, namely MatinaRoberta(a masked language model) and MatinaSRoberta(a fine-tuned Sentence-BERT), along with a comprehensive benchmarking framework. Three datasets-general knowledge(PQuad), scientifically specialized texts, and organizational reports, were used to assess these models after they were trained on a varied corpus of 73.11 billion Persian tokens. The methodology involved extensive pretraining, fine-tuning with tailored loss functions, and systematic evaluations using both traditional metrics and the Retrieval-Augmented Generation Assessment framework. The results show that MatinaSRoberta outperformed previous embeddings, achieving superior contextual relevance and retrieval accuracy across datasets. Temperature tweaking, chunk size modifications, and document summary indexing were explored to enhance RAG setups. Larger models like Llama-3.1 (70B) consistently demonstrated the highest generation accuracy, while smaller models faced challenges with domain-specific and formal contexts. The findings underscore the potential for developing RAG systems in Persian through customized embeddings and retrieval-generation settings and highlight the enhancement of NLP applications such as search engines and legal document analysis in low-resource languages.
翻译:本文探讨了在低资源语言中构建检索增强生成(RAG)系统所面临的具体挑战,重点关注波斯语复杂的形态结构和灵活的句法特性。研究旨在通过引入波斯语专用模型——即MatinaRoberta(一种掩码语言模型)和MatinaSRoberta(一种微调后的Sentence-BERT)——以及一个综合基准测试框架,提升检索与生成的准确性。这些模型在包含731.1亿波斯语标记的多样化语料库上训练后,使用三个数据集——通用知识(PQuad)、科学专业文本和组织报告——进行评估。方法论包括大规模预训练、采用定制损失函数的微调,以及结合传统指标和检索增强生成评估框架的系统性评估。结果表明,MatinaSRoberta在各项数据集上均优于现有嵌入模型,实现了更优的上下文相关性和检索准确率。研究通过温度参数调整、文本块尺寸优化和文档摘要索引等技术探索了增强RAG系统性能的路径。大型模型如Llama-3.1(70B)始终表现出最高的生成准确率,而较小模型在处理领域特定和正式语境时面临挑战。这些发现凸显了通过定制化嵌入模型和检索-生成配置开发波斯语RAG系统的潜力,并表明该技术能有效增强低资源语言在搜索引擎、法律文档分析等自然语言处理应用中的性能。