We consider the generative modeling of speech over multiple minutes, a requirement for long-form multimedia generation and audio-native voice assistants. However, current spoken language models struggle to generate plausible speech past tens of seconds, from high temporal resolution of speech tokens causing loss of coherence, to architectural issues with long-sequence training or extrapolation, to memory costs at inference time. With these considerations we propose SpeechSSM, the first speech language model to learn from and sample long-form spoken audio (e.g., 16 minutes of read or extemporaneous speech) in a single decoding session without text intermediates, based on recent advances in linear-time sequence modeling. Furthermore, to address growing challenges in spoken language evaluation, especially in this new long-form setting, we propose: new embedding-based and LLM-judged metrics; quality measurements over length and time; and a new benchmark for long-form speech processing and generation, LibriSpeech-Long. Speech samples and the dataset are released at https://google.github.io/tacotron/publications/speechssm/
翻译:本文研究了多分钟时长语音的生成建模问题,这是长格式多媒体生成和原生音频语音助手的必备能力。然而,当前的口语语言模型难以生成超过数十秒的合理语音,其原因包括:语音标记的高时间分辨率导致连贯性丧失、长序列训练或外推的架构问题,以及推理时的内存成本。基于这些考量,我们提出了SpeechSSM,这是首个基于线性时间序列建模的最新进展、能够从长格式口语音频(例如16分钟的朗读或即兴演讲)中学习并在无需文本中间表示的单次解码会话中采样的语音语言模型。此外,为应对口语评估中日益增长的挑战,尤其是在这一新的长格式场景下,我们提出了:新的基于嵌入和LLM评判的度量标准;跨长度和时间的质量测量方法;以及一个用于长格式语音处理与生成的新基准——LibriSpeech-Long。语音样本及数据集发布于 https://google.github.io/tacotron/publications/speechssm/