Punctuation restoration plays an essential role in the post-processing procedure of automatic speech recognition, but model efficiency is a key requirement for this task. To that end, we present EfficientPunct, an ensemble method with a multimodal time-delay neural network that outperforms the current best model by 1.0 F1 points, using less than a tenth of its inference network parameters. We streamline a speech recognizer to efficiently output hidden layer acoustic embeddings for punctuation restoration, as well as BERT to extract meaningful text embeddings. By using forced alignment and temporal convolutions, we eliminate the need for attention-based fusion, greatly increasing computational efficiency and raising performance. EfficientPunct sets a new state of the art with an ensemble that weights BERT's purely language-based predictions slightly more than the multimodal network's predictions. Our code is available at https://github.com/lxy-peter/EfficientPunct.
翻译:标点恢复在自动语音识别的后处理流程中扮演关键角色,而模型效率是该任务的核心需求。为此,我们提出EfficientPunct——一种基于多模态时延神经网络的集成方法,在推理网络参数不足现有最优模型十分之一的条件下,其F1分数仍高出1.0个点。我们通过精简语音识别器高效输出隐层声学嵌入用于标点恢复,并利用BERT提取有意义的文本嵌入。通过强制对齐与时序卷积,我们消除了基于注意力机制的融合需求,显著提升计算效率与性能表现。EfficientPunct通过集成方法设立新标杆:在集成模型中,将BERT纯语言预测的权重略微高于多模态网络预测。我们的代码开源于https://github.com/lxy-peter/EfficientPunct。