The field of natural language processing (NLP) has recently witnessed a transformative shift with the emergence of foundation models, particularly Large Language Models (LLMs) that have revolutionized text-based NLP. This paradigm has extended to other modalities, including speech, where researchers are actively exploring the combination of Speech Foundation Models (SFMs) and LLMs into single, unified models capable of addressing multimodal tasks. Among such tasks, this paper focuses on speech-to-text translation (ST). By examining the published papers on the topic, we propose a unified view of the architectural solutions and training strategies presented so far, highlighting similarities and differences among them. Based on this examination, we not only organize the lessons learned but also show how diverse settings and evaluation approaches hinder the identification of the best-performing solution for each architectural building block and training choice. Lastly, we outline recommendations for future works on the topic aimed at better understanding the strengths and weaknesses of the SFM+LLM solutions for ST.
翻译:自然语言处理领域近期经历了基础模型的出现所带来的变革性转变,尤其是彻底改变了基于文本的自然语言处理的大型语言模型。这种范式已扩展到包括语音在内的其他模态,研究人员正积极探索将语音基础模型和大型语言模型整合为能够处理多模态任务的统一模型。在这些任务中,本文聚焦于语音到文本翻译。通过审视该主题已发表的论文,我们提出了对迄今呈现的架构解决方案和训练策略的统一视角,强调它们之间的异同。基于这一审视,我们不仅整理了经验教训,还展示了多样化的设置和评估方法如何阻碍了针对每个架构构建模块和训练选择的最佳性能方案的识别。最后,我们提出了针对该主题未来工作的建议,旨在更深入地理解语音基础模型与大型语言模型在语音翻译解决方案中的优势与不足。