To support latency-sensitive AI applications ranging from autonomous driving to industrial robot manipulation, 6G envisions distributed ML with computational resources in mobile, edge, and cloud connected over hyper-reliable low-latency communication (HRLLC). In this setting, speculative decoding can facilitate collaborative inference of models distributively deployed: a lightweight on-device model locally generates drafts while a more capable remote target model on a server verifies and corrects them in parallel with speculative sampling, thus resulting in lower latency without compromising accuracy. However, unlike autoregressive text generation, behavior cloning policies, typically used for embodied AI applications, cannot parallelize verification and correction for multiple drafts as each generated action depends on observation updated by a previous action. To this end, we propose Action Deviation-Aware Hybrid Inference (ADAHI), wherein drafts are selectively transmitted and verified based on action deviation, which has a strong correlation with action's rejection probability by the target model. By invoking server operation only when necessary, communication and computational overhead can be reduced while accuracy gain from speculative sampling is preserved. Experiments on our testbed show that ADAHI reduces transmission and server operations by approximately 40%, lowers end-to-end latency by 39.2%, and attains up to 97.2% of the task-success rate of baseline that invokes speculative sampling for every draft embedding vector.
翻译:为支持从自动驾驶到工业机器人操控等一系列对延迟敏感的AI应用,6G愿景中提出了分布式机器学习,通过超可靠低延迟通信(HRLLC)连接移动端、边缘端和云端的计算资源。在此背景下,推测解码可促进分布式部署模型的协同推理:轻量级的设备端模型本地生成草稿,而服务器上更强大的远程目标模型通过推测采样并行验证并修正这些草稿,从而在不牺牲准确性的前提下降低延迟。然而,与自回归文本生成不同,通常用于具身AI应用的行为克隆策略无法对多个草稿并行执行验证与修正,因为每个生成的动作都依赖于前一个动作更新的观测状态。为此,我们提出动作偏差感知混合推理(ADAHI),该方法基于动作偏差选择性传输并验证草稿,而动作偏差与目标模型拒绝该动作的概率具有强相关性。通过仅在必要时调用服务器运算,可在保持推测采样带来的精度增益的同时,减少通信与计算开销。在我们的测试平台上进行的实验表明,ADAHI将传输与服务器运算量降低了约40%,端到端延迟降低了39.2%,并达到了基线方法(对每个草稿嵌入向量均调用推测采样)任务成功率的97.2%。