This paper studies a hybrid language model (HLM) architecture that integrates a small language model (SLM) operating on a mobile device with a large language model (LLM) hosted at the base station (BS) of a wireless network. The HLM token generation process follows the speculative inference principle: the SLM's vocabulary distribution is uploaded to the LLM, which either accepts or rejects it, with rejected tokens being resampled by the LLM. While this approach ensures alignment between the vocabulary distributions of the SLM and LLM, it suffers from low token throughput due to uplink transmission and the computation costs of running both language models. To address this, we propose a novel HLM structure coined Uncertainty-aware opportunistic HLM (U-HLM), wherein the SLM locally measures its output uncertainty and skips both uplink transmissions and LLM operations for tokens that are likely to be accepted. This opportunistic skipping is enabled by our empirical finding of a linear correlation between the SLM's uncertainty and the LLM's rejection probability. We analytically derive the uncertainty threshold and evaluate its expected risk of rejection. Simulations show that U-HLM reduces uplink transmissions and LLM computations by 45.93%, while achieving up to 97.54% of the LLM's inference accuracy and 2.54$\times$ faster token throughput than HLM without skipping.
翻译:本文研究一种混合语言模型架构,该架构将运行于移动设备的小型语言模型与部署在无线网络基站的大型语言模型相集成。混合语言模型的词元生成遵循推测推理原则:将小型语言模型的词汇分布上传至大型语言模型,由后者决定接受或拒绝该分布,被拒绝的词元将由大型语言模型重新采样。虽然该方法能确保小型与大型语言模型词汇分布的一致性,但由于上行链路传输及同时运行两种语言模型的计算开销,其词元吞吐量较低。为解决此问题,我们提出一种新型混合语言模型结构——不确定性感知的机遇性混合语言模型,其中小型语言模型在本地测量其输出不确定性,并对可能被接受的词元跳过上行传输和大型语言模型运算。这种机遇性跳过的可行性源于我们通过实证发现的小型语言模型不确定性与大型语言模型拒绝概率之间的线性相关性。我们通过解析推导出不确定性阈值,并评估其预期拒绝风险。仿真结果表明,与无跳过机制的混合语言模型相比,不确定性感知的机遇性混合语言模型能减少45.93%的上行传输和大型语言模型计算量,同时达到大型语言模型推理准确率的97.54%,词元吞吐量提升2.54倍。