Models for low-latency, streaming applications could benefit from the knowledge capacity of larger models, but edge devices cannot run these models due to resource constraints. A possible solution is to transfer hints during inference from a large model running remotely to a small model running on-device. However, this incurs a communication delay that breaks real-time requirements and does not guarantee that both models will operate on the same data at the same time. We propose knowledge boosting, a novel technique that allows a large model to operate on time-delayed input during inference, while still boosting small model performance. Using a streaming neural network that processes 8 ms chunks, we evaluate different speech separation and enhancement tasks with communication delays of up to six chunks or 48 ms. Our results show larger gains where the performance gap between the small and large models is wide, demonstrating a promising method for large-small model collaboration for low-latency applications. Code, dataset, and audio samples available at https://knowledgeboosting.cs.washington.edu/.
翻译:低延迟流式应用模型本可从更大模型的知识容量中获益,但边缘设备因资源限制无法运行这些模型。一种可能的解决方案是在推理过程中将提示从远程运行的大模型传输至设备端运行的小模型。然而,这会引入通信延迟,从而破坏实时性要求,且无法保证两个模型在同一时间处理相同数据。我们提出知识增强这一新技术,使大模型能够在推理过程中处理时延输入,同时仍能提升小模型性能。通过使用处理8毫秒数据块的流式神经网络,我们在通信延迟高达六个数据块(即48毫秒)的条件下评估了不同的语音分离与增强任务。结果表明,在大小模型性能差距较大的场景中能获得更显著的增益,这为低延迟应用中的大小模型协作提供了一种前景广阔的方法。代码、数据集及音频样本详见 https://knowledgeboosting.cs.washington.edu/。