Retrieval Augmented Generation (RAG) is a powerful approach for enhancing the factual grounding of language models by integrating external knowledge. While widely studied for large language models, the optimization of RAG for Small Language Models (SLMs) remains a critical research gap, particularly in complex, multi-hop question-answering tasks that require sophisticated reasoning. In these systems, prompt template design is a crucial yet under-explored factor influencing performance. This paper presents a large-scale empirical study to investigate this factor, evaluating 24 different prompt templates on the HotpotQA dataset. The set includes a standard RAG prompt, nine well-formed techniques from the literature, and 14 novel hybrid variants, all tested on two prominent SLMs: Qwen2.5-3B Instruct and Gemma3-4B-It. Our findings, based on a test set of 18720 instances, reveal significant performance gains of up to 83% on Qwen2.5 and 84.5% on Gemma3-4B-It, yielding an improvement of up to 6% for both models compared to the Standard RAG prompt. This research also offers concrete analysis and actionable recommendations for designing effective and efficient prompts for SLM-based RAG systems, practically for deployment in resource-constrained environments.
翻译:检索增强生成(RAG)是一种通过整合外部知识来增强语言模型事实基础的有效方法。尽管针对大型语言模型的研究已很广泛,但在小型语言模型(SLM)上优化RAG仍是一个关键的研究空白,尤其是在需要复杂推理的多跳问答任务中。在这些系统中,提示模板设计是影响性能的关键因素,但目前尚未得到充分探索。本文通过一项大规模实证研究来探讨这一因素,在HotpotQA数据集上评估了24种不同的提示模板。该模板集包括一个标准RAG提示、九种文献中已有的成熟技术以及14种新颖的混合变体,并在两个主流SLM上进行了测试:Qwen2.5-3B Instruct和Gemma3-4B-It。基于包含18720个实例的测试集,我们的研究结果显示,与标准RAG提示相比,Qwen2.5的性能最高可提升83%,Gemma3-4B-It最高可提升84.5%,两种模型的性能均实现了高达6%的改进。本研究还提供了具体的分析和可操作的建议,用于为基于SLM的RAG系统设计高效且有效的提示,尤其适用于资源受限环境下的实际部署。