The evaluation of Retrieval-Augmented Generation (RAG) systems typically examines retrieval quality and generation parameters like temperature in isolation, overlooking their interaction. This work presents a systematic investigation of how text perturbations (simulating noisy retrieval) interact with temperature settings across multiple LLM runs. We propose a comprehensive RAG Perturbation-Temperature Analysis Framework that subjects retrieved documents to three distinct perturbation types across varying temperature settings. Through extensive experiments on HotpotQA with both open-source and proprietary LLMs, we demonstrate that performance degradation follows distinct patterns: high-temperature settings consistently amplify vulnerability to perturbations, while certain perturbation types exhibit non-linear sensitivity across the temperature range. Our work yields three key contributions: (1) a diagnostic benchmark for assessing RAG robustness, (2) an analytical framework for quantifying perturbation-temperature interactions, and (3) practical guidelines for model selection and parameter tuning under noisy retrieval conditions.
翻译:检索增强生成(RAG)系统的评估通常孤立地考察检索质量与温度等生成参数,忽视了二者的相互作用。本研究系统性地探讨了文本扰动(模拟含噪声的检索结果)如何在多次大语言模型运行中与温度设置相互影响。我们提出一种综合性的RAG扰动-温度分析框架,该框架将检索文档置于三种不同扰动类型下,并在不同温度设置中进行测试。通过在HotpotQA数据集上对开源和闭源大语言模型开展广泛实验,我们证明性能退化遵循显著模式:高温设置持续放大对扰动的脆弱性,而某些扰动类型在温度范围内表现出非线性敏感性。本研究贡献包括:(1)用于评估RAG鲁棒性的诊断基准;(2)量化扰动-温度交互作用的分析框架;(3)在含噪声检索条件下的模型选择与参数调优实用指南。