We introduce PerceptionComp, a manually annotated benchmark for complex, long-horizon, perception-centric video reasoning. PerceptionComp is designed so that no single moment is sufficient: answering each question requires multiple temporally separated pieces of visual evidence and compositional constraints under conjunctive and sequential logic, spanning perceptual subtasks such as objects, attributes, relations, locations, actions, and events, and requiring skills including semantic recognition, visual correspondence, temporal reasoning, and spatial reasoning. The benchmark contains 1,114 highly complex questions on 279 videos from diverse domains including city walk tours, indoor villa tours, video games, and extreme outdoor sports, with 100% manual annotation. Human studies show that PerceptionComp requires substantial test-time thinking and repeated perception steps: participants take much longer than on prior benchmarks, and accuracy drops to near chance (18.97%) when rewatching is disallowed. State-of-the-art MLLMs also perform substantially worse on PerceptionComp than on existing benchmarks: the best model in our evaluation, Gemini-3-Flash, reaches only 45.96% accuracy in the five-choice setting, while open-source models remain below 40%. These results suggest that perception-centric long-horizon video reasoning remains a major bottleneck, and we hope PerceptionComp will help drive progress in perceptual reasoning.
翻译:我们提出PerceptionComp,一个针对复杂、长程、感知中心视频推理的人工标注基准。PerceptionComp的设计确保单一时刻不足以回答问题:每个问题需要从时间上分离的多段视觉证据,并在合取与序贯逻辑下满足组合约束,涵盖对象、属性、关系、位置、动作和事件等感知子任务,涉及语义识别、视觉对应、时间推理及空间推理等能力。该基准包含来自城市漫步、室内别墅游览、电子游戏及极限户外运动等多样领域的279段视频上的1114个高复杂度问题,全部经过人工标注。人类研究表明,PerceptionComp需要大量的即时思考与重复感知步骤:参与者在PerceptionComp上的耗时远超现有基准,且在不允许回看的情况下准确率降至接近随机水平(18.97%)。最先进的多模态大语言模型在PerceptionComp上的表现也显著弱于现有基准:评估中性能最佳模型Gemini-3-Flash在五选设定下仅达45.96%准确率,而开源模型均低于40%。这些结果表明,感知中心的长程视频推理仍是重大瓶颈,我们希望PerceptionComp能推动感知推理领域的进展。