We consider the problem of quantifying how an input perturbation impacts the outputs of large language models (LLMs), a fundamental task for model reliability and post-hoc interpretability. A key obstacle in this domain is disentangling the meaningful changes in model responses from the intrinsic stochasticity of LLM outputs. To overcome this, we introduce Distribution-Based Perturbation Analysis (DBPA), a framework that reformulates LLM perturbation analysis as a frequentist hypothesis testing problem. DBPA constructs empirical null and alternative output distributions within a low-dimensional semantic similarity space via Monte Carlo sampling. Comparisons of Monte Carlo estimates in the reduced dimensionality space enables tractable frequentist inference without relying on restrictive distributional assumptions. The framework is model-agnostic, supports the evaluation of arbitrary input perturbations on any black-box LLM, yields interpretable p-values, supports multiple perturbation testing via controlled error rates, and provides scalar effect sizes for any chosen similarity or distance metric. We demonstrate the effectiveness of DBPA in evaluating perturbation impacts, showing its versatility for perturbation analysis.
翻译:我们研究如何量化输入扰动对大语言模型(LLM)输出的影响,这是模型可靠性与事后可解释性的一项基础任务。该领域的一个关键障碍在于如何将模型响应的有意义变化与LLM输出固有的随机性区分开来。为克服此障碍,我们提出了基于分布的扰动分析(DBPA)框架,该框架将LLM扰动分析重新表述为一个频率主义假设检验问题。DBPA通过蒙特卡洛采样在低维语义相似度空间中构建经验性的零假设与备择假设输出分布。在降维空间中对蒙特卡洛估计量进行比较,使得无需依赖严格分布假设即可进行可处理的频率主义推断。该框架具有模型无关性,支持对任意黑盒LLM的任意输入扰动进行评估,生成可解释的p值,通过受控错误率支持多重扰动检验,并为任意选定的相似性或距离度量提供标量效应大小。我们通过评估扰动影响展示了DBPA的有效性,证明了其在扰动分析中的广泛适用性。