Recent advancements in Large Language Models (LLMs) and Large Vision Language Models (LVLMs) have enabled general-purpose systems to demonstrate promising capabilities in complex reasoning tasks, including those in the medical domain. Medical Visual Question Answering (MedVQA) has particularly benefited from these developments. However, despite Bangla being one of the most widely spoken languages globally, there exists no established MedVQA benchmark for it. To address this gap, we introduce BanglaMedVQA, a dataset comprising clinically validated image-question-answer pairs, along with a comprehensive evaluation of current foundation models on this resource. Consistent with prior findings that report low performance of current models on English MedVQA benchmarks, our analysis reveals that Bangla performance is substantially lower, reflecting the challenges inherent to low-resource languages. Even top-performing models such as Gemini and GPT-4.1 mini fail to accurately answer specialized diagnostic questions, indicating severe limitations in fine-grained medical reasoning. Although certain open-source models, such as Gemma-3, occasionally outperform these models in general categories, they too struggle with clinically complex questions, underscoring the urgent need for top-notch evaluation method.
翻译:大型语言模型(LLM)与大型视觉语言模型(LVLMs)的最新进展,使通用系统在复杂推理任务(包括医学领域)中展现出令人瞩目的能力。医学视觉问答(MedVQA)尤其受益于这些发展。然而,尽管孟加拉语是全球使用最广泛的语言之一,目前尚无针对该语言的MedVQA基准。为填补这一空白,我们提出了BanglaMedVQA数据集,包含临床验证的图像-问题-答案配对,并对当前基础模型在该资源上的表现进行了全面评估。与先前报告当前模型在英语MedVQA基准上表现不佳的研究结论一致,我们的分析显示模型在孟加拉语上的表现显著更低,这反映了低资源语言所固有的挑战。即使是Gemini和GPT-4.1 mini等顶级模型,也无法准确回答专业诊断问题,表明其在细粒度医学推理方面存在严重局限。尽管部分开源模型(如Gemma-3)在通用类别中偶尔优于这些模型,但它们同样难以应对临床复杂问题,这凸显了构建顶级评估方法的迫切需求。