In clinical dictation, utterances after automatic speech recognition (ASR) without explicit punctuation marks may lead to the misunderstanding of dictated reports. To give a precise and understandable clinical report with ASR, automatic punctuation restoration is required. Considering a practical scenario, we propose a fast and light pre-trained model for Chinese medical punctuation restoration based on 'pretraining and fine-tuning' paradigm. In this work, we distill pre-trained models by incorporating supervised contrastive learning and a novel auxiliary pre-training task (Punctuation Mark Prediction) to make it well-suited for punctuation restoration. Our experiments on various distilled models reveal that our model can achieve 95% performance while 10% model size relative to state-of-the-art Chinese RoBERTa.
翻译:在临床听写中,经过自动语音识别(ASR)处理但缺乏明确标点符号的语句可能导致听写报告的误解。为了生成准确且可理解的ASR临床报告,自动标点恢复是必需的。考虑到实际应用场景,我们基于“预训练与微调”范式,提出了一种快速轻量的中文医疗标点恢复预训练模型。在本工作中,我们通过结合监督对比学习与一种新颖的辅助预训练任务(标点符号预测)来蒸馏预训练模型,使其更适用于标点恢复任务。我们在多种蒸馏模型上的实验表明,相对于最先进的中文RoBERTa模型,我们的模型仅需其10%的参数量即可达到95%的性能水平。