Perception Test 2025: Challenge Summary and a Unified VQA Extension

The Third Perception Test challenge was organised as a full-day workshop alongside the IEEE/CVF International Conference on Computer Vision (ICCV) 2025. Its primary goal is to benchmark state-of-the-art video models and measure the progress in multimodal perception. This year, the workshop featured 2 guest tracks as well: KiVA (an image understanding challenge) and Physic-IQ (a video generation challenge). In this report, we summarise the results from the main Perception Test challenge, detailing both the existing tasks as well as novel additions to the benchmark. In this iteration, we placed an emphasis on task unification, as this poses a more challenging test for current SOTA multimodal models. The challenge included five consolidated tracks: unified video QA, unified object and point tracking, unified action and sound localisation, grounded video QA, and hour-long video QA, alongside an analysis and interpretability track that is still open for submissions. Notably, the unified video QA track introduced a novel subset that reformulates traditional perception tasks (such as point tracking and temporal action localisation) as multiple-choice video QA questions that video-language models can natively tackle. The unified object and point tracking merged the original object tracking and point tracking tasks, whereas the unified action and sound localisation merged the original temporal action localisation and temporal sound localisation tracks. Accordingly, we required competitors to use unified approaches rather than engineered pipelines with task-specific models. By proposing such a unified challenge, Perception Test 2025 highlights the significant difficulties existing models face when tackling diverse perception tasks through unified interfaces.

翻译：第三届感知测试挑战赛作为全天研讨会与IEEE/CVF国际计算机视觉大会（ICCV 2025）同期举办，其主要目标是对当前最先进的视频模型进行基准测试，并衡量多模态感知领域的进展。本年度研讨会还设置了两条特邀赛道：KiVA（图像理解挑战）和Physic-IQ（视频生成挑战）。本报告总结了主赛道感知测试挑战赛的结果，详细阐述了既有任务以及基准测试的新增内容。本阶段我们特别强调任务统一性，因为这为当前最先进的多模态模型提出了更具挑战性的测试。挑战赛包含五个整合赛道：统一视频问答、统一物体与点追踪、统一动作与声音定位、基于视频的问答以及长视频问答，同时保留分析可解释性赛道（该赛道仍对外开放投稿）。值得注意的是，统一视频问答赛道引入了全新子集，将传统感知任务（如点追踪和时序动作定位）重构为视频-语言模型可直接处理的多选题视频问答形式。统一物体与点追踪合并了原有的物体追踪和点追踪任务，而统一动作与声音定位则整合了原有的时序动作定位和时序声音定位赛道。据此，我们要求参赛者采用统一方法，而非使用面向特定任务的工程化流水线。通过提出此类统一挑战，感知测试2025凸显了现有模型在通过统一接口处理多样化感知任务时面临的重大困难。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

CVPR2025最新《视觉语言模型》论文：多模态、持续学习、幻觉、提示

专知会员服务

25+阅读 · 2025年3月14日

【万字长文】视觉问答VQA：从早期发展到最新进展——综述

专知会员服务

26+阅读 · 2025年1月8日

【AAAI2024】BOK-VQA：基于双语外部知识的视觉问题回答，通过图表示预训练

专知会员服务

24+阅读 · 2024年1月15日

CVPR 2023开会了！SMU谷歌等最新《视觉异常检测》教程，附300多页ppt

专知会员服务

69+阅读 · 2023年6月20日