Task Me Anything - 专知论文

Benchmarks for large multimodal language models (MLMs) now serve to simultaneously assess the general capabilities of models instead of evaluating for a specific capability. As a result, when a developer wants to identify which models to use for their application, they are overwhelmed by the number of benchmarks and remain uncertain about which benchmark's results are most reflective of their specific use case. This paper introduces Task-Me-Anything, a benchmark generation engine which produces a benchmark tailored to a user's needs. Task-Me-Anything maintains an extendable taxonomy of visual assets and can programmatically generate a vast number of task instances. Additionally, it algorithmically addresses user queries regarding MLM performance efficiently within a computational budget. It contains 113K images, 10K videos, 2K 3D object assets, over 365 object categories, 655 attributes, and 335 relationships. It can generate 750M image/video question-answering pairs, which focus on evaluating MLM perceptual capabilities. Task-Me-Anything reveals critical insights: open-source MLMs excel in object and attribute recognition but lack spatial and temporal understanding; each model exhibits unique strengths and weaknesses; larger models generally perform better, though exceptions exist; and GPT4o demonstrates challenges in recognizing rotating/moving objects and distinguishing colors.

翻译：当前，大规模多模态语言模型（MLM）的基准测试通常旨在综合评估模型的通用能力，而非针对特定能力进行评测。因此，当开发者需要为其应用选择合适的模型时，面对众多的基准测试往往感到无所适从，且不确定哪个基准测试的结果最能反映其具体使用场景。本文介绍了Task-Me-Anything，一个能够根据用户需求生成定制化基准测试的基准生成引擎。Task-Me-Anything维护了一个可扩展的视觉资产分类体系，能够以编程方式生成大量任务实例。此外，它能在有限的计算预算内，通过算法高效处理用户关于MLM性能的查询。该系统包含11.3万张图像、1万个视频、2000个三维物体资产，涵盖超过365个物体类别、655种属性和335种关系。它能够生成7.5亿个图像/视频问答对，重点评估MLM的感知能力。Task-Me-Anything揭示了若干关键发现：开源MLM在物体和属性识别方面表现优异，但在空间和时间理解上存在不足；每个模型都展现出独特的优势与短板；更大规模的模型通常表现更好，但也存在例外；GPT4o在识别旋转/运动物体及区分颜色方面面临挑战。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

37+阅读 · 2019年10月17日