This paper introduces Multiple Choice Reasoning via. Process of Elimination using Multi-Modal models, herein referred to as Multi-Modal Process of Elimination (MM-PoE). This novel methodology is engineered to augment the efficacy of Vision-Language Models (VLMs) in multiple-choice visual reasoning tasks. Diverging from conventional approaches that evaluate each option independently, MM-PoE employs a dual-step scoring paradigm that initially identifies and excludes implausible choices, subsequently concentrating on the most probable remaining options. This method emulates human test-taking strategies, where individuals typically eliminate clearly incorrect answers prior to selecting the optimal response. Our empirical evaluations, conducted across three benchmark datasets, reveal that MM-PoE significantly improves both zero-shot and few-shot performance of contemporary state-of-the-art VLMs. Critically, this approach not only broadens the application of the elimination process to multi-modal contexts but also allows few-shot experiments, thereby addressing two principal limitations concerning usage of PoE only in zero-shot settings and only with a language-only framework. As a result, MM-PoE not only refines the reasoning capabilities of VLMs but also broadens their applicability to complex visual question-answering scenarios. All code and documentation supporting our work are available at https://pypi.org/project/mm-poe/, enabling researchers and practitioners to easily integrate and further develop these techniques.
翻译:本文提出了基于多模态模型通过排除法进行多项选择推理的方法,简称多模态排除法。这一新颖方法旨在增强视觉语言模型在多项选择视觉推理任务中的效能。与传统独立评估每个选项的方法不同,MM-PoE采用双步评分范式:首先识别并排除不合理选项,随后聚焦于剩余的最可能选项。该方法模拟了人类应试策略,即通常在选定最佳答案前先排除明显错误的选项。我们在三个基准数据集上的实证评估表明,MM-PoE显著提升了当前先进视觉语言模型的零样本和少样本性能。值得注意的是,该方法不仅将排除过程的应用扩展到多模态场景,还支持少样本实验,从而解决了排除法此前仅适用于零样本设置和纯语言框架的两个主要局限。因此,MM-PoE不仅提升了视觉语言模型的推理能力,还拓宽了其在复杂视觉问答场景中的适用性。支持本工作的所有代码和文档已发布于https://pypi.org/project/mm-poe/,便于研究者和实践者集成并进一步发展这些技术。