We introduce MultiMedEval, an open-source toolkit for fair and reproducible evaluation of large, medical vision-language models (VLM). MultiMedEval comprehensively assesses the models' performance on a broad array of six multi-modal tasks, conducted over 23 datasets, and spanning over 11 medical domains. The chosen tasks and performance metrics are based on their widespread adoption in the community and their diversity, ensuring a thorough evaluation of the model's overall generalizability. We open-source a Python toolkit (github.com/corentin-ryr/MultiMedEval) with a simple interface and setup process, enabling the evaluation of any VLM in just a few lines of code. Our goal is to simplify the intricate landscape of VLM evaluation, thus promoting fair and uniform benchmarking of future models.
翻译:我们推出开源工具包MultiMedEval,用于公平可重复地评估大型医学视觉语言模型(VLM)。该工具包在11个以上医学领域的23个数据集上,系统性评估模型在六大多模态任务中的表现。所选任务与性能指标基于其在学术界的广泛采用度与多样性,确保对模型整体泛化能力的全面评估。我们开源了基于Python的工具包(github.com/corentin-ryr/MultiMedEval),其简易的接口与配置流程允许用户仅用数行代码即可评估任意VLM。我们的目标在于简化复杂的VLM评估体系,推动未来模型实现公平统一的基准测试。