Although frame-based models, such as CTC and transducers, have an affinity for streaming automatic speech recognition, their decoding uses no future knowledge, which could lead to incorrect pruning. Conversely, label-based attention encoder-decoder mitigates this issue using soft attention to the input, while it tends to overestimate labels biased towards its training domain, unlike CTC. We exploit these complementary attributes and propose to integrate the frame- and label-synchronous (F-/L-Sync) decoding alternately performed within a single beam-search scheme. F-Sync decoding leads the decoding for block-wise processing, while L-Sync decoding provides the prioritized hypotheses using look-ahead future frames within a block. We maintain the hypotheses from both decoding methods to perform effective pruning. Experiments demonstrate that the proposed search algorithm achieves lower error rates compared to the other search methods, while being robust against out-of-domain situations.
翻译:虽然基于帧的模型(如CTC和换能器)具有适用于流式自动语音识别的特性,但其解码过程不利用未来知识,可能导致错误剪枝。相反,基于标签的注意力编码器-解码器通过使用输入的软注意力机制缓解了这一问题,但相比CTC,其倾向于过度估计偏向于训练领域的标签。我们利用这些互补特性,提出在单一集束搜索框架内交替执行帧同步(F-Sync)与标签同步(L-Sync)解码的融合方案。F-Sync解码主导分块处理过程的解码,而L-Sync解码则利用分块内的前瞻未来帧提供优先排序的假设。我们保留两种解码方法的假设以实现有效的剪枝。实验表明,与其它搜索方法相比,所提搜索算法在保持对域外场景鲁棒性的前提下,实现了更低的错误率。