Recent advances in large language models have led to renewed interest in natural language processing in healthcare using the free text of clinical notes. One distinguishing characteristic of clinical notes is their long time span over multiple long documents. The unique structure of clinical notes creates a new design choice: when the context length for a language model predictor is limited, which part of clinical notes should we choose as the input? Existing studies either choose the inputs with domain knowledge or simply truncate them. We propose a framework to analyze the sections with high predictive power. Using MIMIC-III, we show that: 1) predictive power distribution is different between nursing notes and discharge notes and 2) combining different types of notes could improve performance when the context length is large. Our findings suggest that a carefully selected sampling function could enable more efficient information extraction from clinical notes.
翻译:近期大型语言模型的进展引发了利用临床记录自由文本进行医疗自然语言处理的重新关注。临床记录的一个显著特征是其跨越多个长文档的长时间跨度。临床记录的特殊结构带来了新的设计选择:当语言模型预测器的上下文长度受限时,应选择临床记录的哪一部分作为输入?现有研究要么依赖领域知识选择输入,要么直接截断输入。我们提出一个框架来分析具有高预测能力的章节。利用MIMIC-III数据集,我们证明:1) 护理记录与出院记录的预测能力分布存在差异;2) 当上下文长度较大时,结合不同类型的记录可提升性能。我们的研究结果表明,精心设计的采样函数能够更高效地从临床记录中提取信息。