When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference

Majority voting over sampled answers is the dominant unsupervised aggregator for multi-sample LLM inference. In this paper, we show a delegation-based aggregator (Propagational Proxy Voting, PPV; Sakai et al., 2025) yields an unsupervised consensus rule that beats majority on MMLU-Pro by +1.5 pp overall and +2.24 pp on the non-trivial subset (paired McNemar p ~ 1.0e-14, n = 8,099). Majority discards two signals that every sample carries: within-group letter entropy and between-group reasoning geometry. PPV exposes per-voter levers that consume exactly these two signals: When (how much weight a voter keeps on its own pick) and Whom (how it splits the remainder across peers). We drive When with letter entropy and Whom with per-question-centered embedding cosine. Our method needs no gold labels and no auxiliary training: per-question, we partition 128 sampled generations into 16 groups, compute each group's letter-level semantic entropy and reasoning embedding centroid, and feed both into a stochastic delegation matrix whose stationary distribution selects the consensus answer. We walk through an example in which PPV overturns a clear 10-6 majority for the wrong letter: the 10-voter majority cluster is geometrically incoherent (mean within-cluster cosine -0.02) while the 6-voter minority is tight (+0.26), so propagated delegation mass concentrates on the minority's answer even though entropy alone would keep the majority ahead. We further report delegation strategies with negative results that constrain the design space for unsupervised LLM aggregation. No within-question ensemble of confidence modes closes the oracle gap.

翻译：基于采样答案的多数投票是多样本大语言模型推理中主流的无监督聚合方法。本文证明了一种基于委托的聚合器（传播代理投票法，简称PPV；Sakai等人，2025）能产生无监督共识规则，在MMLU-Pro数据集上整体比多数投票高出1.5个百分点，在非平凡子集上高出2.24个百分点（配对McNemar检验p值约1.0e-14，样本量n=8,099）。多数投票丢弃了每个样本携带的两类信号：组内字母熵与组间推理几何结构。PPV揭示了每位投票者消耗这两类信号的杠杆：委托权重（投票者保留自身选择的权重）和委托对象（投票者将剩余权重分配给同行的方式）。我们采用字母熵驱动委托权重，采用基于问题中心的嵌入余弦值驱动委托对象。本方法无需真实标签及辅助训练：对每个问题，我们将128个生成样本划分为16个组，计算每组的字母级语义熵与推理嵌入质心，并将两者输入随机委托矩阵，其稳态分布选择共识答案。我们通过案例展示PPV推翻了一个明显的10-6多数投票错误：10票多数簇的几何一致性差（簇内平均余弦值-0.02），而6票少数簇高度紧凑（+0.26），因此即便仅靠熵值会使多数方保持领先，传播委托质量仍会集中在少数方答案上。我们还报告了负面的委托策略结果，这些结果为无监督大语言模型聚合的设计空间划定了约束边界。任何问题内置信模式集成都无法弥合与理想答案的鸿沟。