The ability to compare objects, scenes, or situations is crucial for effective decision-making and problem-solving in everyday life. For instance, comparing the freshness of apples enables better choices during grocery shopping, while comparing sofa designs helps optimize the aesthetics of our living space. Despite its significance, the comparative capability is largely unexplored in artificial general intelligence (AGI). In this paper, we introduce CompBench, a benchmark designed to evaluate the comparative reasoning capability of multimodal large language models (MLLMs). CompBench mines and pairs images through visually oriented questions covering eight dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. We curate a collection of around 40K image pairs using metadata from diverse vision datasets and CLIP similarity scores. These image pairs span a broad array of visual domains, including animals, fashion, sports, and both outdoor and indoor scenes. The questions are carefully crafted to discern relative characteristics between two images and are labeled by human annotators for accuracy and relevance. We use CompBench to evaluate recent MLLMs, including GPT-4V(ision), Gemini-Pro, and LLaVA-1.6. Our results reveal notable shortcomings in their comparative abilities. We believe CompBench not only sheds light on these limitations but also establishes a solid foundation for future enhancements in the comparative capability of MLLMs.
翻译:比较物体、场景或情境的能力对于日常生活中的有效决策和问题解决至关重要。例如,比较苹果的新鲜度有助于在购物时做出更优选择,而比较沙发设计则能优化居住空间的美学效果。尽管比较能力意义重大,但在通用人工智能(AGI)领域却鲜有探索。本文提出CompBench,这是一个旨在评估多模态大语言模型(MLLMs)比较推理能力的基准。CompBench通过视觉导向的问题挖掘并配对图像,涵盖相对比较的八个维度:视觉属性、存在性、状态、情绪、时间性、空间性、数量和质量。我们利用多样化视觉数据集的元数据和CLIP相似度分数,构建了包含约40K对图像的集合。这些图像对覆盖广泛的视觉领域,包括动物、时尚、运动以及室内外场景。问题经过精心设计以区分两幅图像间的相对特征,并由人工标注者进行准确性和相关性标注。我们使用CompBench评估了包括GPT-4V(ision)、Gemini-Pro和LLaVA-1.6在内的近期MLLMs。实验结果表明这些模型在比较能力上存在显著缺陷。我们相信CompBench不仅揭示了这些局限性,更为未来提升MLLMs的比较能力奠定了坚实基础。