Many large language models (LLMs) for medicine have largely been evaluated on short texts, and their ability to handle longer sequences such as a complete electronic health record (EHR) has not been systematically explored. Assessing these models on long sequences is crucial since prior work in the general domain has demonstrated performance degradation of LLMs on longer texts. Motivated by this, we introduce LongBoX, a collection of seven medical datasets in text-to-text format, designed to investigate model performance on long sequences. Preliminary experiments reveal that both medical LLMs (e.g., BioGPT) and strong general domain LLMs (e.g., FLAN-T5) struggle on this benchmark. We further evaluate two techniques designed for long-sequence handling: (i) local-global attention, and (ii) Fusion-in-Decoder (FiD). Our results demonstrate mixed results with long-sequence handling - while scores on some datasets increase, there is substantial room for improvement. We hope that LongBoX facilitates the development of more effective long-sequence techniques for the medical domain. Data and source code are available at https://github.com/Mihir3009/LongBoX.
翻译:许多面向医学的大型语言模型(LLM)主要在短文本上被评估,而它们处理较长序列(如完整电子健康记录)的能力尚未得到系统探究。评估这些模型在长序列上的表现至关重要,因为通用领域的前期工作已显示LLM在处理较长文本时性能会下降。基于此,我们提出LongBoX——一个包含七个医疗数据集的文本到文本格式集合,旨在研究模型在长序列上的表现。初步实验表明,无论是医学LLM(如BioGPT)还是通用领域强劲的LLM(如FLAN-T5),在此基准测试中均表现不佳。我们进一步评估了两种专为长序列处理设计的技术:(i)局部-全局注意力机制,以及(ii)解码器融合。结果显示,长序列处理的效果参差不齐——尽管某些数据集的得分有所提升,但仍存在显著改进空间。我们希望LongBoX能促进更有效的医学领域长序列技术的发展。数据和源代码可访问https://github.com/Mihir3009/LongBoX获取。