Small vision-language models (2-8B) are well-suited for clinical deployment due to privacy constraints, limited connectivity, and low-latency requirements favouring on-device or on-premise inference. However, their limited capacity exacerbates the generation of plausible but incorrect outputs. We extend game-theoretic decoding, previously restricted to text-only, closed-ended NLP tasks, to vision-language models for open-ended Medical VQA. We introduce a semantically aware Wasserstein stopping criterion that replaces lexical order matching, enabling convergence based on semantic consensus among near-synonymous candidate answers and avoiding unnecessary iterations caused by clinically equivalent ranking swaps. On VQA-RAD and PathVQA, we obtain consistent, statistically significant improvements over greedy and discriminative baselines. On VQA-RAD, we improve Qwen3-VL-2B by +3.5 percentage points (p < 0.01), surpassing the greedy 4B model, with similar trends at larger scales. On PathVQA, Gemma-3-4B with BDG matches MedGemma-4B under greedy decoding despite no domain-specific fine-tuning. At accuracy parity with classic BDG, the Wasserstein criterion reduces average convergence iterations by approximately 20%, improving inference efficiency while preserving the game-theoretic equilibrium behaviour. Code is available at https://github.com/luca-hagen/ Wasserstein-BDG-medical-VQA.
翻译:小型视觉语言模型(2-8B)因隐私限制、有限连接性以及对设备端或本地推理的低延迟需求,非常适合临床部署。然而,其有限的能力加剧了生成看似合理但实际错误的输出。我们将此前仅限于文本型封闭式NLP任务的博弈论解码方法扩展至视觉语言模型,用于开放式医学VQA任务。我们引入一种具有语义感知能力的Wasserstein停止准则,替代了词汇顺序匹配,使得收敛基于近义候选答案间的语义共识,避免了因临床等效排名交换而导致的不必要迭代。在VQA-RAD和PathVQA数据集上,我们相对于贪心解码和判别式基线取得了一致且统计显著的提升。在VQA-RAD上,Qwen3-VL-2B模型性能提升3.5个百分点(p<0.01),超越了贪心解码的4B模型,且在更大规模模型上呈现类似趋势。在PathVQA上,未经领域微调的Gemma-3-4B搭配BDG方法达到了贪心解码下MedGemma-4B的水平。在与经典BDG达到相同准确率的情况下,Wasserstein准则将平均收敛迭代次数减少约20%,在保持博弈论均衡行为的同时提高了推理效率。代码开源地址:https://github.com/luca-hagen/Wasserstein-BDG-medical-VQA