Matching Rates and Optimal Allocation for Federated Probe-Logit Distillation under Heterogeneous Bandwidth Budgets

In federated language modeling, $K$ nodes each hold $n$ samples but cannot pool data or exchange full-precision gradients or weights. We study the minimax rate at which a conditional distribution over $V$ tokens can be estimated when each node may upload at most $B$ bits per query in a public probe set. In federated probe-logit distillation (FPLD), each node transmits a scalar-quantized logit vector on the probe set, and an aggregator distills a global parametric student. Prior work (Dubey and Huo, 2026) establishes a high-probability KL rate $O(d/(Kn) + ρ\sqrt{V \log V / m} + K^{-1} \cdot 2^{-2B/V})$ plus optimization slack, with the bandwidth term in its trace-sharpened form. Whether this bandwidth-term rate is tight, and how the upper bound generalizes to heterogeneous per-node bandwidths, are left open. We close both gaps. First, the dithered FPLD construction has a matching single-round lower bound $Ω(K^{-1} \cdot 2^{-2B/V})$ under non-degeneracy, pinning the bandwidth-axis rate at $Θ(K^{-1} \cdot 2^{-2B/V})$. $T$-round sequential refinement with nested/scaled residual quantizers achieves $O(K^{-1} \cdot 2^{-2TB/V})$; vanilla FPLD's $T$-independent bandwidth term is suboptimal for every $T > 1$. Second, we establish a heterogeneous-bandwidth upper bound for per-node budgets $B_i$, paired with a closed-form optimal allocation $B_i^* = B_{\mathrm{tot}}/K + (V/2) \log_2(w_i / \bar{w}_g)$, a log-tilted water-filling rule that is the per-node analogue of reverse water-filling for distortion-rate optimization. A plug-in adaptive variant estimates the weights from a short warm-up phase and attains $1 + O(\sqrt{\log(K/δ)/(m T_0)})$ relative suboptimality. Synthetic n-gram simulations confirm that empirical KL is bracketed by the upper and lower bounds and that the optimal allocation strictly dominates uniform and inverse-weighted baselines under heterogeneous clipping.

翻译：在联邦语言建模中，K 个节点各自持有 n 个样本，但无法合并数据或交换全精度梯度与权重。我们研究了当每个节点在公共探针集上每次查询最多可上传 B 比特时，估计 V 个词符上的条件分布所能达到的极小极大速率。在联邦探针-对数几率蒸馏（FPLD）中，每个节点在探针集上传输一个标量量化的对数几率向量，聚合器则蒸馏出一个全局参数化学生模型。先前的工作（Dubey and Huo, 2026）建立了高概率下的 KL 散度速率 O(d/(Kn) + ρ√(V log V / m) + K^{-1}·2^{-2B/V}) 及优化松弛项，其中带宽项采用了迹锐化形式。该带宽项速率是否紧确，以及上界如何推广到异构的每节点带宽，这些问题尚未解决。我们填补了这两个空白。首先，在非退化条件下，抖动 FPLD 构造具有匹配的单轮下界 Ω(K^{-1}·2^{-2B/V})，从而将带宽轴的速率确定为 Θ(K^{-1}·2^{-2B/V})。采用嵌套/缩放残差量化器的 T 轮顺序细化可实现 O(K^{-1}·2^{-2TB/V}) 的速率；而原始 FPLD 的与 T 无关的带宽项对于任意 T > 1 均非最优。其次，我们针对每节点预算 B_i 建立了异构带宽上界，并给出了闭式最优分配 B_i^* = B_{tot}/K + (V/2) log_2(w_i / \bar{w}_g)，这是一种对数偏斜的注水准则，类似于失真率优化中反向注水的每节点版本。一种即插即用的自适应变体通过短预热阶段估计权重，并达到了 1 + O(√(log(K/δ)/(m T_0))) 的相对次优性。基于合成 n-gram 的仿真证实，经验 KL 散度被上、下界所界定，且在异构限幅条件下，最优分配严格优于均匀分配和逆加权基线。