The Third Perception Test challenge was organised as a full-day workshop alongside the IEEE/CVF International Conference on Computer Vision (ICCV) 2025. Its primary goal is to benchmark state-of-the-art video models and measure the progress in multimodal perception. This year, the workshop featured 2 guest tracks as well: KiVA (an image understanding challenge) and Physic-IQ (a video generation challenge). In this report, we summarise the results from the main Perception Test challenge, detailing both the existing tasks as well as novel additions to the benchmark. In this iteration, we placed an emphasis on task unification, as this poses a more challenging test for current SOTA multimodal models. The challenge included five consolidated tracks: unified video QA, unified object and point tracking, unified action and sound localisation, grounded video QA, and hour-long video QA, alongside an analysis and interpretability track that is still open for submissions. Notably, the unified video QA track introduced a novel subset that reformulates traditional perception tasks (such as point tracking and temporal action localisation) as multiple-choice video QA questions that video-language models can natively tackle. The unified object and point tracking merged the original object tracking and point tracking tasks, whereas the unified action and sound localisation merged the original temporal action localisation and temporal sound localisation tracks. Accordingly, we required competitors to use unified approaches rather than engineered pipelines with task-specific models. By proposing such a unified challenge, Perception Test 2025 highlights the significant difficulties existing models face when tackling diverse perception tasks through unified interfaces.
翻译:第三届感知测试挑战赛作为全天研讨会与IEEE/CVF国际计算机视觉大会(ICCV 2025)同期举办,其主要目标是对当前最先进的视频模型进行基准测试,并衡量多模态感知领域的进展。本年度研讨会还设置了两条特邀赛道:KiVA(图像理解挑战)和Physic-IQ(视频生成挑战)。本报告总结了主赛道感知测试挑战赛的结果,详细阐述了既有任务以及基准测试的新增内容。本阶段我们特别强调任务统一性,因为这为当前最先进的多模态模型提出了更具挑战性的测试。挑战赛包含五个整合赛道:统一视频问答、统一物体与点追踪、统一动作与声音定位、基于视频的问答以及长视频问答,同时保留分析可解释性赛道(该赛道仍对外开放投稿)。值得注意的是,统一视频问答赛道引入了全新子集,将传统感知任务(如点追踪和时序动作定位)重构为视频-语言模型可直接处理的多选题视频问答形式。统一物体与点追踪合并了原有的物体追踪和点追踪任务,而统一动作与声音定位则整合了原有的时序动作定位和时序声音定位赛道。据此,我们要求参赛者采用统一方法,而非使用面向特定任务的工程化流水线。通过提出此类统一挑战,感知测试2025凸显了现有模型在通过统一接口处理多样化感知任务时面临的重大困难。