We introduce STAR (Stream Transduction with Anchor Representations), a novel Transformer-based model designed for efficient sequence-to-sequence transduction over streams. STAR dynamically segments input streams to create compressed anchor representations, achieving nearly lossless compression (12x) in Automatic Speech Recognition (ASR) and outperforming existing methods. Moreover, STAR demonstrates superior segmentation and latency-quality trade-offs in simultaneous speech-to-text tasks, optimizing latency, memory footprint, and quality.
翻译:我们提出STAR(基于锚点表示的流式转导)——一种新型的基于Transformer的模型,专为流式序列到序列的高效转导而设计。STAR通过动态切分输入流来生成压缩的锚点表示,在自动语音识别(ASR)中实现了接近无损的压缩(12倍),并优于现有方法。此外,在同声传译任务中,STAR展现了更优的分段能力和延迟-质量权衡,优化了延迟、内存占用与输出质量。