End-to-end speech summarization has been shown to improve performance over cascade baselines. However, such models are difficult to train on very large inputs (dozens of minutes or hours) owing to compute restrictions and are hence trained with truncated model inputs. Truncation leads to poorer models, and a solution to this problem rests in block-wise modeling, i.e., processing a portion of the input frames at a time. In this paper, we develop a method that allows one to train summarization models on very long sequences in an incremental manner. Speech summarization is realized as a streaming process, where hypothesis summaries are updated every block based on new acoustic information. We devise and test strategies to pass semantic context across the blocks. Experiments on the How2 dataset demonstrate that the proposed block-wise training method improves by 3 points absolute on ROUGE-L over a truncated input baseline.
翻译:端到端语音摘要已被证明能够比级联基线方法取得更优性能。然而,由于计算资源限制,此类模型难以在极长输入(数十分钟或数小时)上进行训练,因此通常采用截断输入进行训练。截断会导致模型性能下降,而解决该问题的一种方法是逐块建模,即每次处理部分输入帧。本文提出了一种方法,支持以增量方式在极长序列上训练摘要模型。我们将语音摘要实现为流式处理过程,其中假设摘要会基于新的声学信息逐块更新。我们设计并验证了跨块传递语义上下文的策略。在How2数据集上的实验表明,所提出的逐块训练方法在ROUGE-L指标上相比截断输入基线方法绝对提升了3个百分点。