Explaining closed-source Large Language Model (LLM) outputs is challenging because API access prevents gradient-based attribution, while perturbation methods are costly and noisy when they depend on regenerated text. We introduce \textbf{Rotary Positional Embedding Linear Local Interpretable Model-agnostic Explanations (RoPE-LIME)}, an open-source extension of gSMILE that decouples reasoning from explanation: given a fixed output from a closed model, a smaller open-source surrogate computes token-level attributions from probability-based objectives (negative log-likelihood and divergence targets) under input perturbations. RoPE-LIME incorporates (i) a locality kernel based on Relaxed Word Mover's Distance computed in \textbf{RoPE embedding space} for stable similarity under masking, and (ii) \textbf{Sparse-$K$} sampling, an efficient perturbation strategy that improves interaction coverage under limited budgets. Experiments on HotpotQA (sentence features) and a hand-labeled MMLU subset (word features) show that RoPE-LIME produces more informative attributions than leave-one-out sampling and improves over gSMILE while substantially reducing closed-model API calls.
翻译:解释闭源大语言模型(LLM)的输出具有挑战性,因为API访问限制了基于梯度的归因方法,而依赖重新生成文本的扰动方法则成本高昂且噪声显著。本文提出**旋转位置嵌入线性局部可解释模型无关解释(RoPE-LIME)**——这是gSMILE的一个开源扩展,其将推理过程与解释生成解耦:在给定闭源模型固定输出的条件下,通过一个较小的开源代理模型,基于概率目标(负对数似然与散度目标)在输入扰动下计算词元级归因。RoPE-LIME包含两个核心组件:(i)在**RoPE嵌入空间**中基于松弛词移距离构建的局部性核函数,确保掩码操作下的相似性度量稳定性;(ii)**稀疏K采样**策略,一种高效的扰动方法,可在有限预算下提升交互覆盖度。在HotpotQA(句子特征)和人工标注的MMLU子集(词级特征)上的实验表明,RoPE-LIME相比留一采样能产生信息量更丰富的归因结果,在显著减少闭源模型API调用量的同时,其性能也优于gSMILE。