Suppose Alice has a distribution $P$ and Bob has a distribution $Q$. Alice wants to generate a sample $a\sim P$ and Bob a sample $b \sim Q$ such that $a = b$ with has as high of probability as possible. It is well-known that, by sampling from an optimal coupling between the distributions, Alice and Bob can achieve $Pr[a = b] = 1 - D_{TV}(P,Q)$, where $D_{TV}(P,Q)$ is the total variation distance. What if Alice and Bob must solve this same problem without communicating at all? Perhaps surprisingly, with access to public randomness, they can still achieve $Pr[a = b] \geq \frac{1 - D_{TV}(P,Q)}{1 + D_{TV}(P,Q)} \geq 1-2D_{TV}(P,Q)$. In fact, this bound can be obtained using a simple protocol based on the Weighted MinHash algorithm. In this work, we explore the communication-free coupling in greater depth. First, we show that an equally simple protocol based on Gumbel sampling matches the worst-case guarantees of the Weighted MinHash approach, but tends to perform better in practice. Conversely, we prove that both approaches are actually sharp: no communication-free protocol can achieve $Pr[a=b]>\frac{1 - D_{TV}(P,Q)}{1 + D_{TV}(P,Q)}$ in the worst-case. Finally, we prove that, for distributions over $n$ items, there exists a scheme that uses just $O(\log(n/\epsilon))$ bits of communication to achieve $Pr[a = b] = 1 - D_{TV}(P,Q) - \epsilon$, i.e. to essentially match optimal coupling. Beyond our theoretical results, we demonstrate an application of communication-free coupling to speculative decoding, a recent method for accelerating autoregressive large language models [Leviathan, Kalman, Matias, ICML 2023]. We show that communication-free protocols yield a variant of speculative decoding that we call Drafter-Invariant Speculative Decoding, which has the desirable property that the output of the method is fixed given a fixed random seed, regardless of what drafter is used for speculation.
翻译:假设Alice拥有分布$P$,Bob拥有分布$Q$。Alice希望生成样本$a\sim P$,Bob希望生成样本$b \sim Q$,并使得$a = b$的概率尽可能高。众所周知,通过对分布间的最优耦合进行采样,Alice和Bob可以实现$Pr[a = b] = 1 - D_{TV}(P,Q)$,其中$D_{TV}(P,Q)$为总变差距离。如果Alice和Bob必须在完全不通信的情况下解决相同的问题呢?或许令人惊讶的是,借助公共随机性,他们仍然可以实现$Pr[a = b] \geq \frac{1 - D_{TV}(P,Q)}{1 + D_{TV}(P,Q)} \geq 1-2D_{TV}(P,Q)$。实际上,这一界限可以通过一种基于加权最小哈希算法的简单协议获得。在本工作中,我们更深入地探讨了无需通信的耦合问题。首先,我们证明了一种基于Gumbel采样的同等简单的协议,其最坏情况保证与加权最小哈希方法相匹配,但在实践中往往表现更优。相反,我们证明这两种方法实际上都是尖锐的:在最坏情况下,任何无需通信的协议都无法实现$Pr[a=b]>\frac{1 - D_{TV}(P,Q)}{1 + D_{TV}(P,Q)}$。最后,我们证明对于$n$个项上的分布,存在一种方案仅使用$O(\log(n/\epsilon))$比特的通信即可实现$Pr[a = b] = 1 - D_{TV}(P,Q) - \epsilon$,从而在本质上匹配最优耦合。除了理论结果外,我们展示了无需通信的耦合在推测解码中的应用,这是一种用于加速自回归大语言模型的新方法[Leviathan, Kalman, Matias, ICML 2023]。我们证明,无需通信的协议产生了一种我们称之为“起草者无关的推测解码”的推测解码变体,该变体具有一个理想特性:在给定固定随机种子的情况下,无论使用何种起草模型进行推测,该方法的输出都是确定的。