Although Large Language Models (LLMs) excel in many tasks, their application to Speech-to-Speech Translation (S2ST) is underexplored and hindered by data scarcity. To bridge this gap, we propose PROST-LLM (PROgressive Speech-to-speech Translation) to enhance the S2ST capabilities in LLMs progressively. First, we fine-tune the LLMs with the CVSS corpus, employing designed tri-task learning and chain of modality methods to boost the initial performance. Then, leveraging the fine-tuned model, we generate preference pairs through self-sampling and back-translation without human evaluation. Finally, these preference pairs are used for preference optimization to enhance the model's S2ST capability further. Extensive experiments confirm the effectiveness of our proposed PROST-LLM in improving the S2ST capability of LLMs.
翻译:尽管大型语言模型在许多任务上表现出色,但其在语音到语音翻译中的应用仍未被充分探索,且受限于数据稀缺问题。为弥补这一差距,我们提出了PROST-LLM,旨在渐进式增强大型语言模型的语音到语音翻译能力。首先,我们利用CVSS语料库对大型语言模型进行微调,采用设计的三任务学习与模态链方法以提升初始性能。随后,借助微调后的模型,我们通过自采样与回译技术生成偏好对,无需人工评估。最后,这些偏好对被用于偏好优化,以进一步提升模型的语音到语音翻译能力。大量实验证实了我们所提出的PROST-LLM在提升大型语言模型语音到语音翻译能力方面的有效性。