Streaming speech-to-text translation (StreamST) is the task of automatically translating speech while incrementally receiving an audio stream. Unlike simultaneous ST (SimulST), which deals with pre-segmented speech, StreamST faces the challenges of handling continuous and unbounded audio streams. This requires additional decisions about what to retain of the previous history, which is impractical to keep entirely due to latency and computational constraints. Despite the real-world demand for real-time ST, research on streaming translation remains limited, with existing works solely focusing on SimulST. To fill this gap, we introduce StreamAtt, the first StreamST policy, and propose StreamLAAL, the first StreamST latency metric designed to be comparable with existing metrics for SimulST. Extensive experiments across all 8 languages of MuST-C v1.0 show the effectiveness of StreamAtt compared to a naive streaming baseline and the related state-of-the-art SimulST policy, providing a first step in StreamST research.
翻译:流式语音到文本翻译(StreamST)是在逐步接收音频流的同时自动翻译语音的任务。与处理预分段语音的同时性语音翻译(SimulST)不同,StreamST面临处理连续且无界音频流的挑战。这需要额外决策关于保留先前历史中的哪些内容,由于延迟和计算限制,完全保留历史是不切实际的。尽管现实世界对实时语音翻译存在需求,但关于流式翻译的研究仍然有限,现有工作仅专注于SimulST。为填补这一空白,我们引入了StreamAtt——首个StreamST策略,并提出了StreamLAAL——首个StreamST延迟度量标准,旨在与现有的SimulST度量标准具有可比性。在MuST-C v1.0全部8种语言上进行的大量实验表明,相较于简单的流式基线方法和相关的最先进SimulST策略,StreamAtt具有显著效果,为StreamST研究提供了初步探索。