How Good LLMs Are at Answering Bangla Medical Visual Questions? Dataset and Benchmarking

Recent advancements in Large Language Models (LLMs) and Large Vision Language Models (LVLMs) have enabled general-purpose systems to demonstrate promising capabilities in complex reasoning tasks, including those in the medical domain. Medical Visual Question Answering (MedVQA) has particularly benefited from these developments. However, despite Bangla being one of the most widely spoken languages globally, there exists no established MedVQA benchmark for it. To address this gap, we introduce BanglaMedVQA, a dataset comprising clinically validated image-question-answer pairs, along with a comprehensive evaluation of current foundation models on this resource. Consistent with prior findings that report low performance of current models on English MedVQA benchmarks, our analysis reveals that Bangla performance is substantially lower, reflecting the challenges inherent to low-resource languages. Even top-performing models such as Gemini and GPT-4.1 mini fail to accurately answer specialized diagnostic questions, indicating severe limitations in fine-grained medical reasoning. Although certain open-source models, such as Gemma-3, occasionally outperform these models in general categories, they too struggle with clinically complex questions, underscoring the urgent need for top-notch evaluation method.

翻译：大型语言模型（LLM）与大型视觉语言模型（LVLMs）的最新进展，使通用系统在复杂推理任务（包括医学领域）中展现出令人瞩目的能力。医学视觉问答（MedVQA）尤其受益于这些发展。然而，尽管孟加拉语是全球使用最广泛的语言之一，目前尚无针对该语言的MedVQA基准。为填补这一空白，我们提出了BanglaMedVQA数据集，包含临床验证的图像-问题-答案配对，并对当前基础模型在该资源上的表现进行了全面评估。与先前报告当前模型在英语MedVQA基准上表现不佳的研究结论一致，我们的分析显示模型在孟加拉语上的表现显著更低，这反映了低资源语言所固有的挑战。即使是Gemini和GPT-4.1 mini等顶级模型，也无法准确回答专业诊断问题，表明其在细粒度医学推理方面存在严重局限。尽管部分开源模型（如Gemma-3）在通用类别中偶尔优于这些模型，但它们同样难以应对临床复杂问题，这凸显了构建顶级评估方法的迫切需求。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

LLM/智能体作为数据分析师：综述

专知会员服务

38+阅读 · 2025年9月30日

大型语言模型（LLM）智能体全栈安全的综述：数据、训练与部署

专知会员服务

33+阅读 · 2025年4月23日

利用多个大型语言模型：关于LLM集成的调研

专知会员服务

35+阅读 · 2025年2月27日

如何构建o1模型推理能力？清华北大等提出LLaVA-o1: 让视觉语言模型逐步推理

专知会员服务

31+阅读 · 2024年11月19日