In clinical dictation, utterances after automatic speech recognition (ASR) without explicit punctuation marks may lead to the misunderstanding of dictated reports. To give a precise and understandable clinical report with ASR, automatic punctuation restoration is required. Considering a practical scenario, we propose a fast and light pre-trained model for Chinese medical punctuation restoration based on 'pretraining and fine-tuning' paradigm. In this work, we distill pre-trained models by incorporating supervised contrastive learning and a novel auxiliary pre-training task (Punctuation Mark Prediction) to make it well-suited for punctuation restoration. Our experiments on various distilled models reveal that our model can achieve 95% performance while 10% model size relative to state-of-the-art Chinese RoBERTa.
翻译:在临床口述中,经自动语音识别(ASR)处理后不带明确标点符号的语句可能导致对口述报告的误解。为提供清晰可理解的ASR临床报告,需进行自动标点恢复。考虑实际应用场景,我们基于"预训练与微调"范式提出一种快速轻量的预训练模型,专用于中文医疗标点恢复。本研究通过融合监督对比学习与新型辅助预训练任务(标点符号预测)对预训练模型进行蒸馏,使其更适应标点恢复任务。在多种蒸馏模型上的实验表明,与当前最优的中文RoBERTa相比,本模型能以10%的模型尺寸达到95%的性能。