Benchmarks for large multimodal language models (MLMs) now serve to simultaneously assess the general capabilities of models instead of evaluating for a specific capability. As a result, when a developer wants to identify which models to use for their application, they are overwhelmed by the number of benchmarks and remain uncertain about which benchmark's results are most reflective of their specific use case. This paper introduces Task-Me-Anything, a benchmark generation engine which produces a benchmark tailored to a user's needs. Task-Me-Anything maintains an extendable taxonomy of visual assets and can programmatically generate a vast number of task instances. Additionally, it algorithmically addresses user queries regarding MLM performance efficiently within a computational budget. It contains 113K images, 10K videos, 2K 3D object assets, over 365 object categories, 655 attributes, and 335 relationships. It can generate 750M image/video question-answering pairs, which focus on evaluating MLM perceptual capabilities. Task-Me-Anything reveals critical insights: open-source MLMs excel in object and attribute recognition but lack spatial and temporal understanding; each model exhibits unique strengths and weaknesses; larger models generally perform better, though exceptions exist; and GPT4o demonstrates challenges in recognizing rotating/moving objects and distinguishing colors.
翻译:当前,大规模多模态语言模型(MLM)的基准测试通常旨在综合评估模型的通用能力,而非针对特定能力进行评测。因此,当开发者需要为其应用选择合适的模型时,面对众多的基准测试往往感到无所适从,且不确定哪个基准测试的结果最能反映其具体使用场景。本文介绍了Task-Me-Anything,一个能够根据用户需求生成定制化基准测试的基准生成引擎。Task-Me-Anything维护了一个可扩展的视觉资产分类体系,能够以编程方式生成大量任务实例。此外,它能在有限的计算预算内,通过算法高效处理用户关于MLM性能的查询。该系统包含11.3万张图像、1万个视频、2000个三维物体资产,涵盖超过365个物体类别、655种属性和335种关系。它能够生成7.5亿个图像/视频问答对,重点评估MLM的感知能力。Task-Me-Anything揭示了若干关键发现:开源MLM在物体和属性识别方面表现优异,但在空间和时间理解上存在不足;每个模型都展现出独特的优势与短板;更大规模的模型通常表现更好,但也存在例外;GPT4o在识别旋转/运动物体及区分颜色方面面临挑战。