The field of natural language processing (NLP) has recently witnessed a transformative shift with the emergence of foundation models, particularly Large Language Models (LLMs) that have revolutionized text-based NLP. This paradigm has extended to other modalities, including speech, where researchers are actively exploring the combination of Speech Foundation Models (SFMs) and LLMs into single, unified models capable of addressing multimodal tasks. Among such tasks, this paper focuses on speech-to-text translation (ST). By examining the published papers on the topic, we propose a unified view of the architectural solutions and training strategies presented so far, highlighting similarities and differences among them. Based on this examination, we not only organize the lessons learned but also show how diverse settings and evaluation approaches hinder the identification of the best-performing solution for each architectural building block and training choice. Lastly, we outline recommendations for future works on the topic aimed at better understanding the strengths and weaknesses of the SFM+LLM solutions for ST.
翻译:自然语言处理领域近期经历了由基础模型引发的变革性转变,特别是大语言模型彻底改变了基于文本的自然语言处理范式。这一范式已扩展至包括语音在内的其他模态,研究者正积极探索将语音基础模型与大语言模型结合为单一统一模型,以处理多模态任务。本文聚焦于此类任务中的语音到文本翻译。通过梳理该主题已发表文献,我们提出了对现有架构方案与训练策略的统一分析框架,系统阐释其异同点。基于此分析,我们不仅整合了现有经验,更揭示了多样化的实验设置与评估方法如何阻碍了对各架构模块及训练方案最优性能的判定。最后,我们为未来研究方向提出建议,旨在更深入理解SFM+LLM方案在语音翻译任务中的优势与局限。