The combination of Large Language Models (LLM) and Automatic Speech Recognition (ASR), when deployed on edge devices (called edge ASR-LLM), can serve as a powerful personalized assistant to enable audio-based interaction for users. Compared to text-based interaction, edge ASR-LLM allows accessible and natural audio interactions. Unfortunately, existing ASR-LLM models are mainly trained in high-performance computing environments and produce substantial model weights, making them difficult to deploy on edge devices. More importantly, to better serve users' personalized needs, the ASR-LLM must be able to learn from each distinct user, given that audio input often contains highly personalized characteristics that necessitate personalized on-device training. Since individually fine-tuning the ASR or LLM often leads to suboptimal results due to modality-specific limitations, end-to-end training ensures seamless integration of audio features and language understanding (cross-modal alignment), ultimately enabling a more personalized and efficient adaptation on edge devices. However, due to the complex training requirements and substantial computational demands of existing approaches, cross-modal alignment between ASR audio and LLM can be challenging on edge devices. In this work, we propose a resource-efficient cross-modal alignment framework that bridges ASR and LLMs on edge devices to handle personalized audio input. Our framework enables efficient ASR-LLM alignment on resource-constrained devices like NVIDIA Jetson Orin (8GB RAM), achieving 50x training time speedup while improving the alignment quality by more than 50\%. To the best of our knowledge, this is the first work to study efficient ASR-LLM alignment on resource-constrained edge devices.
翻译:大型语言模型与自动语音识别的结合,当部署于边缘设备时(称为边缘ASR-LLM),可作为一种强大的个性化助手,为用户提供基于音频的交互。与基于文本的交互相比,边缘ASR-LLM实现了便捷且自然的音频交互。然而,现有的ASR-LLM模型主要在高性能计算环境中训练,产生庞大的模型权重,使其难以部署于边缘设备。更重要的是,为更好地满足用户的个性化需求,ASR-LLM必须能够从每个特定用户中学习,因为音频输入通常包含高度个性化的特征,需要个性化的设备端训练。由于单独微调ASR或LLM常因模态特定限制而导致次优结果,端到端训练确保了音频特征与语言理解的无缝集成(跨模态对齐),最终在边缘设备上实现更个性化、更高效的适应。然而,由于现有方法复杂的训练要求和巨大的计算需求,ASR音频与LLM之间的跨模态对齐在边缘设备上可能具有挑战性。在本工作中,我们提出了一种资源高效的跨模态对齐框架,在边缘设备上桥接ASR与LLM以处理个性化音频输入。我们的框架能够在资源受限设备(如NVIDIA Jetson Orin,8GB内存)上实现高效的ASR-LLM对齐,实现50倍的训练时间加速,同时将对齐质量提升超过50%。据我们所知,这是首个研究在资源受限边缘设备上实现高效ASR-LLM对齐的工作。