Heteroskedastic Signals in Budgeted LLM Verification: Structural Heterogeneity Limits Optimization Gains

Large language model (LLM) systems increasingly use uncertainty signals to allocate limited computation across verification, test-time scaling, tool execution, and other selective-compute decisions. Such policies rely on a \emph{global signal comparability assumption}: equal scores should carry comparable decision value across inputs. Using budgeted verification as a controlled diagnostic setting, we identify a failure mode of this assumption: uncertainty quality is heteroskedastic across cost strata, with some regions exhibiting near-random discriminability despite concentrating many errors. Under an explicit local model, we characterize the resulting distortion of global allocation and show that its upper bound scales with cross-stratum signal-quality dispersion. We separate weak signals, optimization instability, and structural heterogeneity through a controlled intervention hierarchy: Threshold, MP-Adapt, MP-Strat, and a deliberately simple cost-stratified thresholding intervention (CST). Across MBPP and MATH using Qwen3-8B, LLaMA3-8B, and GPT-4o-mini, global online adaptation yields inconsistent gains over static thresholding; MP-Strat partially recovers performance, while CST improves hit rate by up to 17 percentage points in strongly heterogeneous settings without gradient updates. These results identify structural heterogeneity, rather than optimizer weakness alone, as the primary bottleneck in the observed settings. More broadly, misaligned feedback structure cannot always be repaired by stronger optimization.

翻译：大语言模型（LLM）系统日益采用不确定性信号来在验证、测试时扩展、工具执行及其他选择性计算决策中分配有限的计算资源。此类策略依赖于一个全局信号可比性假设：相等的评分在不同输入间应具有可比的决策价值。通过将预算受限验证作为受控诊断场景，我们识别出该假设的一种失效模式：不确定性质量在不同成本层级间呈现异方差性，其中某些区域尽管集中了大量错误，却展现出近乎随机的判别能力。基于显式的局部模型，我们刻画了由此产生的全局分配扭曲，并证明其上限与跨层信号质量离散度呈比例关系。我们通过一个受控干预层级结构——阈值（Threshold）、MP-Adapt、MP-Strat以及一种刻意简化的成本分层阈值干预（CST）——来分离弱信号、优化不稳定性和结构异质性。在基于Qwen3-8B、LLaMA3-8B和GPT-4o-mini的MBPP和MATH数据集上，全局在线自适应相比静态阈值化带来的收益不一致；MP-Strat部分恢复了性能，而CST在强异质性设置下无需梯度更新即可将命中率提升高达17个百分点。这些结果表明，在观测到的场景中，主要瓶颈是结构异质性，而非单纯的优化器弱点。更广泛而言，错配的反馈结构并不总能通过更强的优化来修复。