With the rise of Speech Large Language Models (Speech LLMs), there has been growing interest in discrete speech tokens for their ability to integrate with text-based tokens seamlessly. Compared to most studies that focus on continuous speech features, although discrete-token based LLMs have shown promising results on certain tasks, the performance gap between these two paradigms is rarely explored. In this paper, we present a fair and thorough comparison between discrete and continuous features across a variety of semantic-related tasks using a light-weight LLM (Qwen1.5-0.5B). Our findings reveal that continuous features generally outperform discrete tokens, particularly in tasks requiring fine-grained semantic understanding. Moreover, this study goes beyond surface-level comparison by identifying key factors behind the under-performance of discrete tokens, such as limited token granularity and inefficient information retention. To enhance the performance of discrete tokens, we explore potential aspects based on our analysis. We hope our results can offer new insights into the opportunities for advancing discrete speech tokens in Speech LLMs.
翻译:随着语音大语言模型(Speech LLMs)的兴起,离散语音标记因其能够与基于文本的标记无缝集成而受到越来越多的关注。与大多数聚焦于连续语音特征的研究相比,尽管基于离散标记的LLMs在某些任务上已显示出有希望的结果,但这两种范式之间的性能差距却鲜有探索。本文使用一个轻量级LLM(Qwen1.5-0.5B),在各种语义相关任务上对离散特征与连续特征进行了公平而全面的比较。我们的研究结果表明,连续特征通常优于离散标记,尤其是在需要细粒度语义理解的任务中。此外,本研究超越了表面比较,识别了离散标记性能不足背后的关键因素,例如有限的标记粒度和低效的信息保留。为了提升离散标记的性能,我们基于分析探索了潜在的改进方向。我们希望我们的研究结果能为推进语音大语言模型中离散语音标记的发展提供新的见解。