This work proposes a frame-wise online/streaming end-to-end neural diarization (FS-EEND) method in a frame-in-frame-out fashion. To frame-wisely detect a flexible number of speakers and extract/update their corresponding attractors, we propose to leverage a causal speaker embedding encoder and an online non-autoregressive self-attention-based attractor decoder. A look-ahead mechanism is adopted to allow leveraging some future frames for effectively detecting new speakers in real time and adaptively updating speaker attractors. The proposed method processes the audio stream frame by frame, and has a low inference latency caused by the look-ahead frames. Experiments show that, compared with the recently proposed block-wise online methods, our method FS-EEND achieves state-of-the-art diarization results, with a low inference latency and computational cost.
翻译:本文提出了一种帧级在线/流式端到端神经说话人日志方法(FS-EEND),以帧入帧出的方式运行。为在帧级别实时检测可变数量的说话人并提取/更新其对应的吸引子,我们提出利用因果说话人编码器与在线非自回归自注意力吸引子解码器。采用前瞻机制以有效利用未来帧实时检测新说话人并自适应更新说话人吸引子。所提方法逐帧处理音频流,由前瞻帧引起的推理延迟较低。实验表明,与近期提出的分块在线方法相比,本方法FS-EEND在低推理延迟和低计算成本条件下取得了最先进的日志结果。