This study presents three deidentified large medical text datasets, named DISCHARGE, ECHO and RADIOLOGY, which contain 50K, 16K and 378K pairs of report and summary that are derived from MIMIC-III, respectively. We implement convincing baselines of automated abstractive summarization on the proposed datasets with pre-trained encoder-decoder language models, including BERT2BERT, T5-large and BART. Further, based on the BART model, we leverage the sampled summaries from the train set as prior knowledge guidance, for encoding additional contextual representations of the guidance with the encoder and enhancing the decoding representations in the decoder. The experimental results confirm the improvement of ROUGE scores and BERTScore made by the proposed method, outperforming the larger model T5-large.
翻译:本研究提出了三个去标识化的大型医学文本数据集,分别命名为DISCHARGE、ECHO和RADIOLOGY,这些数据集源自MIMIC-III,分别包含5万、1.6万和37.8万对报告与摘要。我们基于预训练的编码器-解码器语言模型(包括BERT2BERT、T5-large和BART),在所提出的数据集上实现了具有说服力的自动抽象式摘要基线。此外,基于BART模型,我们利用训练集中的采样摘要作为先验知识指导,通过编码器对指导信息进行额外的上下文表示编码,并在解码器中增强解码表示。实验结果表明,所提方法在ROUGE分数和BERTScore上均有提升,且超越了更大的T5-large模型。