As large multimodal models (LMMs) are increasingly deployed across diverse applications, the need for adaptable, real-world model ranking has become paramount. Traditional evaluation methods are largely dataset-centric, relying on fixed, labeled datasets and supervised metrics, which are resource-intensive and may lack generalizability to novel scenarios, highlighting the importance of unsupervised ranking. In this work, we explore unsupervised model ranking for LMMs by leveraging their uncertainty signals, such as softmax probabilities. We evaluate state-of-the-art LMMs (e.g., LLaVA) across visual question answering benchmarks, analyzing how uncertainty-based metrics can reflect model performance. Our findings show that uncertainty scores derived from softmax distributions provide a robust, consistent basis for ranking models across varied tasks. This finding enables the ranking of LMMs on real-world, unlabeled data for visual question answering, providing a practical approach for selecting models across diverse domains without requiring manual annotation.
翻译:随着大规模多模态模型(LMMs)在多样化应用中的部署日益广泛,对适应性强、面向真实场景的模型排序需求变得至关重要。传统评估方法主要围绕数据集展开,依赖固定的标注数据集和监督指标,这些方法不仅资源消耗大,且可能缺乏对新场景的泛化能力,这凸显了无监督排序的重要性。在本研究中,我们通过利用LMMs的不确定性信号(如softmax概率)探索其无监督模型排序方法。我们在视觉问答基准测试中对前沿LMMs(如LLaVA)进行评估,分析基于不确定性的度量如何反映模型性能。研究结果表明,从softmax分布导出的不确定性分数能为不同任务间的模型排序提供稳健且一致的依据。这一发现使得在真实世界无标注数据上对视觉问答LMMs进行排序成为可能,为跨领域模型选择提供了一种无需人工标注的实用方法。