Simultaneous speech translation (SST) aims to provide real-time translation of spoken language, even before the speaker finishes their sentence. Traditionally, SST has been addressed primarily by cascaded systems that decompose the task into subtasks, including speech recognition, segmentation, and machine translation. However, the advent of deep learning has sparked significant interest in end-to-end (E2E) systems. Nevertheless, a major limitation of most approaches to E2E SST reported in the current literature is that they assume that the source speech is pre-segmented into sentences, which is a significant obstacle for practical, real-world applications. This thesis proposal addresses end-to-end simultaneous speech translation, particularly in the long-form setting, i.e., without pre-segmentation. We present a survey of the latest advancements in E2E SST, assess the primary obstacles in SST and its relevance to long-form scenarios, and suggest approaches to tackle these challenges.
翻译:同步语音翻译(SST)旨在说话者尚未说完一句话时,就实时提供语音翻译。传统上,SST主要通过级联系统实现,即将任务分解为语音识别、语音分割和机器翻译等子任务。然而,深度学习的出现激发了人们对端到端(E2E)系统的广泛兴趣。但当前文献中报道的大多数E2E SST方法存在一个主要局限性:它们假设源语音已经预先分割成句子,这在实际应用中是重大障碍。本论文提案旨在解决端到端同步语音翻译,特别是无预分割的连续长句场景。我们综述了E2E SST的最新进展,评估了SST的主要障碍及其与长句场景的关联性,并提出了应对这些挑战的方法。