Models for low-latency, streaming applications could benefit from the knowledge capacity of larger models, but edge devices cannot run these models due to resource constraints. A possible solution is to transfer hints during inference from a large model running remotely to a small model running on-device. However, this incurs a communication delay that breaks real-time requirements and does not guarantee that both models will operate on the same data at the same time. We propose knowledge boosting, a novel technique that allows a large model to operate on time-delayed input during inference, while still boosting small model performance. Using a streaming neural network that processes 8 ms chunks, we evaluate different speech separation and enhancement tasks with communication delays of up to six chunks or 48 ms. Our results show larger gains where the performance gap between the small and large models is wide, demonstrating a promising method for large-small model collaboration for low-latency applications. Code, dataset, and audio samples available at https://knowledgeboosting.cs.washington.edu/.
翻译:低延迟流式应用模型可受益于大型模型的知识容量,但边缘设备因资源限制无法运行这些模型。一种可能的解决方案是在推理过程中将提示从远程运行的大型模型传输至设备端运行的小型模型。然而,这会引入通信延迟,破坏实时性要求,且无法保证两个模型同时处理相同数据。我们提出知识增强技术,该创新方法使大型模型能够在推理过程中处理时延输入,同时仍能提升小型模型性能。通过采用处理8毫秒数据块的流式神经网络,我们在通信延迟高达六个数据块(48毫秒)的条件下评估了多种语音分离与增强任务。结果表明,在小型与大型模型性能差距较大的场景中提升效果更为显著,这为低延迟应用中的大小模型协作提供了一种前景广阔的方法。代码、数据集及音频样本详见 https://knowledgeboosting.cs.washington.edu/。