Tiny-Align: Bridging Automatic Speech Recognition and Large Language Model on the Edge

The combination of Large Language Models (LLM) and Automatic Speech Recognition (ASR), when deployed on edge devices (called edge ASR-LLM), can serve as a powerful personalized assistant to enable audio-based interaction for users. Compared to text-based interaction, edge ASR-LLM allows accessible and natural audio interactions. Unfortunately, existing ASR-LLM models are mainly trained in high-performance computing environments and produce substantial model weights, making them difficult to deploy on edge devices. More importantly, to better serve users' personalized needs, the ASR-LLM must be able to learn from each distinct user, given that audio input often contains highly personalized characteristics that necessitate personalized on-device training. Since individually fine-tuning the ASR or LLM often leads to suboptimal results due to modality-specific limitations, end-to-end training ensures seamless integration of audio features and language understanding (cross-modal alignment), ultimately enabling a more personalized and efficient adaptation on edge devices. However, due to the complex training requirements and substantial computational demands of existing approaches, cross-modal alignment between ASR audio and LLM can be challenging on edge devices. In this work, we propose a resource-efficient cross-modal alignment framework that bridges ASR and LLMs on edge devices to handle personalized audio input. Our framework enables efficient ASR-LLM alignment on resource-constrained devices like NVIDIA Jetson Orin (8GB RAM), achieving 50x training time speedup while improving the alignment quality by more than 50\%. To the best of our knowledge, this is the first work to study efficient ASR-LLM alignment on resource-constrained edge devices.

翻译：大型语言模型与自动语音识别的结合，当部署于边缘设备时（称为边缘ASR-LLM），可作为一种强大的个性化助手，为用户提供基于音频的交互。与基于文本的交互相比，边缘ASR-LLM实现了便捷且自然的音频交互。然而，现有的ASR-LLM模型主要在高性能计算环境中训练，产生庞大的模型权重，使其难以部署于边缘设备。更重要的是，为更好地满足用户的个性化需求，ASR-LLM必须能够从每个特定用户中学习，因为音频输入通常包含高度个性化的特征，需要个性化的设备端训练。由于单独微调ASR或LLM常因模态特定限制而导致次优结果，端到端训练确保了音频特征与语言理解的无缝集成（跨模态对齐），最终在边缘设备上实现更个性化、更高效的适应。然而，由于现有方法复杂的训练要求和巨大的计算需求，ASR音频与LLM之间的跨模态对齐在边缘设备上可能具有挑战性。在本工作中，我们提出了一种资源高效的跨模态对齐框架，在边缘设备上桥接ASR与LLM以处理个性化音频输入。我们的框架能够在资源受限设备（如NVIDIA Jetson Orin，8GB内存）上实现高效的ASR-LLM对齐，实现50倍的训练时间加速，同时将对齐质量提升超过50%。据我们所知，这是首个研究在资源受限边缘设备上实现高效ASR-LLM对齐的工作。