Multimodal in-context learning (ICL) remains underexplored despite significant potential for domains such as medicine. Clinicians routinely encounter diverse, specialized tasks requiring adaptation from limited examples, such as drawing insights from a few relevant prior cases or considering a constrained set of differential diagnoses. While multimodal large language models (MLLMs) have shown advances in medical visual question answering (VQA), their ability to learn multimodal tasks from context is largely unknown. We introduce SMMILE, the first expert-driven multimodal ICL benchmark for medical tasks. Eleven medical experts curated problems, each including a multimodal query and multimodal in-context examples as task demonstrations. SMMILE encompasses 111 problems (517 question-image-answer triplets) covering 6 medical specialties and 13 imaging modalities. We further introduce SMMILE++, an augmented variant with 1038 permuted problems. A comprehensive evaluation of 15 MLLMs demonstrates that most models exhibit moderate to poor multimodal ICL ability in medical tasks. In open-ended evaluations, ICL contributes only an 8% average improvement over zero-shot on SMMILE and 9.4% on SMMILE++. We observe a susceptibility for irrelevant in-context examples: even a single noisy or irrelevant example can degrade performance by up to 9.5%. Moreover, we observe that MLLMs are affected by a recency bias, where placing the most relevant example last can lead to substantial performance improvements of up to 71%. Our findings highlight critical limitations and biases in current MLLMs when learning multimodal medical tasks from context. SMMILE is available at https://smmile-benchmark.github.io.
翻译:多模态上下文学习(ICL)尽管在医学等领域具有巨大潜力,但目前仍未得到充分探索。临床医生经常面临多样化的专业任务,需要从有限的示例中进行适应,例如从少量相关既往病例中提取见解,或考虑一组受限的鉴别诊断。尽管多模态大语言模型(MLLMs)在医学视觉问答(VQA)方面已取得进展,但其从上下文中学习多模态任务的能力在很大程度上仍是未知的。我们推出了SMMILE,这是首个面向医学任务的专家驱动的多模态ICL基准。十一位医学专家精心设计了问题,每个问题包含一个多模态查询和多模态上下文示例作为任务演示。SMMILE涵盖111个问题(共517个问题-图像-答案三元组),涉及6个医学专业和13种成像模态。我们进一步推出了SMMILE++,这是一个包含1038个排列问题的增强变体。对15个MLLMs的综合评估表明,大多数模型在医学任务中表现出中等至较差的多模态ICL能力。在开放式评估中,与零样本学习相比,ICL在SMMILE上仅带来平均8%的提升,在SMMILE++上为9.4%。我们观察到模型对不相关上下文示例的敏感性:即使仅有一个噪声或无关示例,性能也可能下降高达9.5%。此外,我们发现MLLMs受到近因偏差的影响,将最相关的示例置于最后可带来高达71%的性能提升。我们的研究结果突显了当前MLLMs从上下文中学习多模态医学任务时的关键局限性和偏差。SMMILE可在 https://smmile-benchmark.github.io 获取。