面向波斯语的检索增强生成技术进展：语言模型开发、综合基准测试及优化最佳实践 (Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization)

This paper examines the specific obstacles of constructing Retrieval-Augmented Generation(RAG) systems in low-resource languages, with a focus on Persian's complicated morphology and versatile syntax. The research aims to improve retrieval and generation accuracy by introducing Persian-specific models, namely MatinaRoberta(a masked language model) and MatinaSRoberta(a fine-tuned Sentence-BERT), along with a comprehensive benchmarking framework. Three datasets-general knowledge(PQuad), scientifically specialized texts, and organizational reports, were used to assess these models after they were trained on a varied corpus of 73.11 billion Persian tokens. The methodology involved extensive pretraining, fine-tuning with tailored loss functions, and systematic evaluations using both traditional metrics and the Retrieval-Augmented Generation Assessment framework. The results show that MatinaSRoberta outperformed previous embeddings, achieving superior contextual relevance and retrieval accuracy across datasets. Temperature tweaking, chunk size modifications, and document summary indexing were explored to enhance RAG setups. Larger models like Llama-3.1 (70B) consistently demonstrated the highest generation accuracy, while smaller models faced challenges with domain-specific and formal contexts. The findings underscore the potential for developing RAG systems in Persian through customized embeddings and retrieval-generation settings and highlight the enhancement of NLP applications such as search engines and legal document analysis in low-resource languages.

翻译：本文探讨了在低资源语言中构建检索增强生成（RAG）系统所面临的具体挑战，重点关注波斯语复杂的形态结构和灵活的句法特性。研究旨在通过引入波斯语专用模型——即MatinaRoberta（一种掩码语言模型）和MatinaSRoberta（一种微调后的Sentence-BERT）——以及一个综合基准测试框架，提升检索与生成的准确性。这些模型在包含731.1亿波斯语标记的多样化语料库上训练后，使用三个数据集——通用知识（PQuad）、科学专业文本和组织报告——进行评估。方法论包括大规模预训练、采用定制损失函数的微调，以及结合传统指标和检索增强生成评估框架的系统性评估。结果表明，MatinaSRoberta在各项数据集上均优于现有嵌入模型，实现了更优的上下文相关性和检索准确率。研究通过温度参数调整、文本块尺寸优化和文档摘要索引等技术探索了增强RAG系统性能的路径。大型模型如Llama-3.1（70B）始终表现出最高的生成准确率，而较小模型在处理领域特定和正式语境时面临挑战。这些发现凸显了通过定制化嵌入模型和检索-生成配置开发波斯语RAG系统的潜力，并表明该技术能有效增强低资源语言在搜索引擎、法律文档分析等自然语言处理应用中的性能。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日