Recent advances in Speech Large Language Models (Speech-LLMs) have made significant progress, greatly enhancing multimodal interaction capabilities.However, their application in low-resource and dialect-diverse environments still faces challenges. The severe scarcity of Tibetan data, coupled with the phonetic differences among its major dialects (Ü-Tsang, Amdo, and Kham), is a prime example of this challenge. This paper proposes Ti-Audio, the first multi-dialectal end-to-end Speech-LLM for Tibetan. To efficiently align speech and text, we introduce a Dynamic Q-Former Adapter that extracts essential acoustic features from variable-length speech, ensuring stable cross-modal alignment even with limited data. At the data level, we leverage mutual assistance among related dialects to alleviate data scarcity and employ a temperature-based sampling strategy to maximize this synergy. Experimental results demonstrate that Ti-Audio achieves state-of-the-art performance on Tibetan benchmarks for automatic speech recognition and speech translation. Our work validates the effectiveness of cross-dialectal cooperation and provides a scalable paradigm for the development of Speech-LLM in low-resource scenarios.
翻译:近年来,语音大语言模型(Speech-LLMs)取得了显著进展,极大地增强了多模态交互能力。然而,其在低资源及方言多样性环境中的应用仍面临挑战。藏语数据严重匮乏,加之其主要方言(卫藏、安多、康巴)间的语音差异,便是这一挑战的典型例证。本文提出Ti-Audio,这是首个面向藏语的多方言端到端语音大语言模型。为高效对齐语音与文本,我们引入动态Q-Former适配器(Dynamic Q-Former Adapter),从可变长度语音中提取关键声学特征,确保在数据有限的情况下实现稳定的跨模态对齐。在数据层面,我们利用相关方言间的互助力缓解数据稀缺问题,并采用基于温度的采样策略以最大化这种协同效应。实验结果表明,Ti-Audio在藏语自动语音识别与语音翻译基准测试中均达到了最先进性能。我们的工作验证了跨方言协作的有效性,并为低资源场景下语音大语言模型的发展提供了可扩展范式。