Recent work on Speech-to-Text Translation (S2TT) has focused on LLM-based models, introducing the increasingly adopted Chain-of-Thought (CoT) prompting, where the model is guided to first transcribe the speech and then translate it. CoT typically outperforms direct prompting primarily because it can exploit abundant Automatic Speech Recognition (ASR) and Text-to-Text Translation (T2TT) datasets to explicitly model its steps. In this paper, we systematically compare CoT and Direct prompting under increasing amounts of S2TT data. To this end, we pseudo-label an ASR corpus by translating its transcriptions into six European languages, and train LLM-based S2TT systems with both prompting strategies at different data scales. Our results show that Direct improves more consistently as the amount of data increases, suggesting that it may become a more effective approach as larger S2TT resources are created.
翻译:近期语音到文本翻译的研究聚焦于基于大语言模型的系统,引入了日益流行的思维链提示方法,即引导模型先转录语音再进行翻译。思维链方法通常优于直接提示,主要因其能够利用丰富的自动语音识别和文本到文本翻译数据集来显式建模翻译步骤。本文系统比较了在不同规模语音翻译数据下思维链与直接提示的性能差异。为此,我们通过将自动语音识别语料的转录文本翻译为六种欧洲语言来构建伪标注数据集,并基于不同数据规模训练采用两种提示策略的大语言模型语音翻译系统。实验结果表明,随着数据量增加,直接提示方法的性能提升更为稳定,这表明随着更大规模语音翻译资源的创建,直接提示可能成为更有效的实现路径。