Test-time compute (TTC) strategies have emerged as a lightweight approach to boost reasoning in large language models (LLMs). However, their application and benefits for vision-language models (VLMs) remain underexplored. We present a systematic study of TTC across seven VLMs and six benchmarks, specifically analyzing feature-based scoring and majority voting methods. We find that feature heuristics fail and voting yields only modest gains in single-model settings. We theoretically show that this limitation stems from a lack of prediction diversity: when outputs are highly correlated, voting provides little benefit. In contrast, multi-model ensembles offer richer diversity, yet standard majority voting fails to account for varying model capabilities. To address this, we propose Entropy-based TTC (ETTC), which selects the most confident prediction based on predictive entropy. Our method reduces to majority voting in the single-model case, but in model ensembles, it leverages confidence disparities to prioritize stronger models. We prove that ETTC outperforms majority voting under mild assumptions and empirically demonstrate that it consistently surpasses both voting and the best individual model. Crucially, our results show that smaller models can synergistically enhance larger ones, unlocking ensembling gains not achievable with standard strategies.
翻译:测试时计算(TTC)策略已成为一种轻量级方法,用于提升大型语言模型(LLMs)的推理能力。然而,其在视觉-语言模型(VLMs)中的应用与效益尚未得到充分探索。我们对七个VLMs和六个基准数据集进行了系统性的TTC研究,特别分析了基于特征的评分和多数投票方法。我们发现特征启发式方法效果不佳,且在单一模型设置中,投票仅带来微小改善。我们从理论上证明,这一局限性源于预测多样性的缺乏:当输出高度相关时,投票几乎无益。相比之下,多模型集成提供了更丰富的多样性,但标准多数投票未能考虑不同模型能力的差异。为解决此问题,我们提出基于熵的测试时计算(ETTC),该方法根据预测熵选择最置信的预测。在单一模型情形下,该方法退化为多数投票;而在模型集成中,它利用置信度差异优先选择更强模型。我们证明在温和假设下ETTC优于多数投票,并通过实验表明其始终超越投票及最优单个模型。关键的是,我们的结果显示,较小模型可协同增强较大模型,释放标准策略无法实现的集成增益。