Federated Language Models Under Bandwidth Budgets: Distillation Rates and Conformal Coverage

Training a language model on data scattered across bandwidth-limited nodes that cannot be centralized is a setting that arises in clinical networks, enterprise knowledge bases, and scientific consortia. We study the regime in which data must remain distributed across nodes, and ask what statistical guarantees are in principle achievable under explicit bandwidth budgets; we aim to characterize what is provably possible, not to demonstrate a deployment-ready system. Existing theory treats either training-time consistency or inference-time calibration in isolation, and none makes bandwidth a first-class statistical parameter. We analyze two protocols, Federated Probe-Logit Distillation (FPLD) for training and Federated Conformal RAG (FC-RAG) for inference, as the analytical vehicles for our results. Our first main result is an explicit high-probability KL-consistency rate for FPLD with simultaneous dependence on node count $K$, per-node sample size $n$, quantization budget $B$, probe-set size $m$, and vocabulary size $V$; bandwidth enters only through an exponentially vanishing quantization term. Our second main result is a distribution-free marginal-coverage bound for FC-RAG, whose novel retrieval-bandwidth slack $Δ_{\mathrm{RAG}} = f_{\max}\sqrt{K^{-2}\sum_i v(B_i)}$ makes per-node retrieval bandwidth a first-class statistical parameter, with arithmetic aggregation across $K$ nodes shrinking the slack as $K^{-1/2}$ in the per-node-uniform regime. A Pinsker-type corollary composes the two bounds into an end-to-end coverage guarantee. Synthetic experiments verify the predicted scaling along the bounds' parameters; small-scale experiments on a GPT-2 testbed illustrate that the qualitative bandwidth-accuracy tradeoff survives on a real language model. A deployment-scale empirical evaluation is out of scope.

翻译：在临床网络、企业知识库和科学联盟等场景中，语言模型的训练数据分散于带宽受限、无法集中处理的节点上。本文研究数据必须保持分布式存储的场景，并探究在明确带宽预算下理论上可实现的统计保证目标：我们旨在刻画可证明的可能性边界，而非展示可部署系统。现有理论分别处理训练时一致性或推理时校准问题，且未将带宽作为一阶统计参数。我们分析联邦探针-对数蒸馏（FPLD）训练协议与联邦共形RAG（FC-RAG）推理协议作为研究载体。第一个主要结果为FPLD的显式高概率KL-一致性速率，其同时依赖于节点数$K$、节点样本量$n$、量化预算$B$、探针集大小$m$及词表大小$V$；带宽仅通过指数衰减的量化项产生影响。第二个主要结果为FC-RAG的无分布边缘覆盖界，其创新性检索带宽松弛项$\Delta_{\mathrm{RAG}} = f_{\max}\sqrt{K^{-2}\sum_i v(B_i)}$将节点检索带宽作为一阶统计参数，在节点均匀场景下通过$K$节点算术聚合使松弛项以$K^{-1/2}$速率缩减。基于Pinsker型推论可将两个边界组合为端到端覆盖保证。合成实验验证了边界参数的理论缩放规律；GPT-2测试平台的小规模实验表明，实际语言模型仍保持定性的带宽-精度权衡关系。部署规模的实证评估超出研究范畴。