MedOdyssey: A Medical Domain Benchmark for Long Context Evaluation Up to 200K Tokens

Numerous advanced Large Language Models (LLMs) now support context lengths up to 128K, and some extend to 200K. Some benchmarks in the generic domain have also followed up on evaluating long-context capabilities. In the medical domain, tasks are distinctive due to the unique contexts and need for domain expertise, necessitating further evaluation. However, despite the frequent presence of long texts in medical scenarios, evaluation benchmarks of long-context capabilities for LLMs in this field are still rare. In this paper, we propose MedOdyssey, the first medical long-context benchmark with seven length levels ranging from 4K to 200K tokens. MedOdyssey consists of two primary components: the medical-context "needles in a haystack" task and a series of tasks specific to medical applications, together comprising 10 datasets. The first component includes challenges such as counter-intuitive reasoning and novel (unknown) facts injection to mitigate knowledge leakage and data contamination of LLMs. The second component confronts the challenge of requiring professional medical expertise. Especially, we design the ``Maximum Identical Context'' principle to improve fairness by guaranteeing that different LLMs observe as many identical contexts as possible. Our experiment evaluates advanced proprietary and open-source LLMs tailored for processing long contexts and presents detailed performance analyses. This highlights that LLMs still face challenges and need for further research in this area. Our code and data are released in the repository: \url{https://github.com/JOHNNY-fans/MedOdyssey.}

翻译：目前，众多先进的大型语言模型（LLM）已支持长达128K的上下文长度，部分模型甚至扩展至200K。通用领域的一些基准测试也已跟进，用于评估长上下文能力。在医学领域，由于独特的上下文环境和对领域专业知识的需求，任务具有特殊性，因此需要进一步的评估。然而，尽管医学场景中长文本频繁出现，针对该领域LLM长上下文能力的评估基准仍然稀缺。本文提出了MedOdyssey，这是首个医学长上下文基准，包含从4K到200K令牌的七个长度级别。MedOdyssey由两个主要部分组成：医学上下文“大海捞针”任务和一系列针对医学应用的具体任务，共计包含10个数据集。第一部分包含反直觉推理和注入新颖（未知）事实等挑战，以减轻LLM的知识泄露和数据污染问题。第二部分则直面需要专业医学知识的挑战。特别地，我们设计了“最大相同上下文”原则，通过保证不同LLM尽可能观察到相同的上下文内容来提高公平性。我们的实验评估了专为处理长上下文设计的先进专有和开源LLM，并提供了详细的性能分析。结果表明，LLM在该领域仍面临挑战，需要进一步的研究。我们的代码和数据已在以下仓库中发布：\url{https://github.com/JOHNNY-fans/MedOdyssey}。