Asymptotically Optimal Sequential Testing with Heterogeneous LLMs

We study a Bayesian binary sequential hypothesis testing problem with multiple large language models (LLMs). Each LLM $j$ has per-query cost $c_j>0$, random waiting time with mean $μ_j>0$ and sub-Gaussian tails, and \emph{asymmetric} accuracies: the probability of returning the correct label depends on the true hypothesis $θ\in\{A,B\}$ and needs not be the same under $A$ and $B$. This asymmetry induces two distinct information rates $(I_{j,A}, I_{j,B})$ per LLM, one under each hypothesis. The decision-maker chooses LLMs sequentially, observes their noisy binary answers, and stops when the posterior probability of one hypothesis exceeds $1-α$. The objective is to minimize the sum of expected query cost and expected waiting cost, $\mathbb{E}[C_π] + \mathbb{E}[g(W_π)]$, where $C_π$ is the total query cost, $W_π$ is the total waiting time and $g$ is a polynomial function (e.g., $g(x)=x^ρ$ with $ρ\ge 1$). We prove that as the error tolerance $α\to0$, the optimal policy is asymptotically equivalent to one that uses at most two LLMs. In this case, a single-LLM policy is \emph{not} generically optimal: optimality now requires exploiting a two-dimensional tradeoff between information under $A$ and information under $B$. Any admissible policy induces an expected information-allocation vector in $\mathbb{R}_+^2$, and we show that the optimal allocation lies at an extreme point of the associated convex set when $α$ is relatively small, and hence uses at most two LLMs. We construct belief-dependent policies that first mix between two LLMs when the posterior is ambiguous, and then switch to a single "specialist" LLM when the posterior is sufficiently close to one of the hypotheses. These policies match the universal lower bound up to a $(1+o(1))$ factor as $α\rightarrow 0$.

翻译：我们研究了一个涉及多个大语言模型(LLM)的贝叶斯二元序贯假设检验问题。每个LLM $j$ 具有每次查询成本$c_j>0$、平均等待时间$μ_j>0$且服从次高斯分布的随机等待时间，以及**非对称**精度：即返回正确标签的概率取决于真实假设$θ\in\{A,B\}$，且在$A$与$B$条件下不必相同。这种非对称性为每个LLM产生两个不同的信息率$(I_{j,A}, I_{j,B})$，分别对应两种假设。决策者依次选择LLM，观测其含噪的二元回答，并当某个假设的后验概率超过$1-α$时停止。优化目标是最小化期望查询成本与期望等待成本之和$\mathbb{E}[C_π] + \mathbb{E}[g(W_π)]$，其中$C_π$为总查询成本，$W_π$为总等待时间，$g$为多项式函数(例如$ρ≥1$时的$g(x)=x^ρ$)。我们证明当误差容限$α→0$时，最优策略渐近等价于最多使用两个LLM的策略。此时单LLM策略并**不是**一般意义上的最优解：最优性需要利用$A$假设与$B$假设下信息之间的二维权衡。任何可行策略都会在$\mathbb{R}_+^2$中诱导一个期望信息分配向量，我们证明当$α$足够小时，最优分配位于相关凸集的极值点，因此最多使用两个LLM。我们构建了依赖信念的策略：当后验概率模糊时先混合使用两个LLM，当后验概率足够接近某个假设时切换为单一"专家"LLM。这些策略在$α→0$时能达到与通用下界相差$(1+o(1))$因子的渐近性能。