The VRR-QA challenge evaluates visual relational reasoning in videos, where answers often depend on implicit spatial relations, event boundaries, target identity, and dialogue context rather than a single salient frame. We present a test-time reasoning pipeline built around a strong GPT-5.5 video QA solver and a set of question-aware evidence ledgers. The initial solver answers each question from a uniform video representation, while routed ledgers are prompted to make the required targets, count units, reference frames, and temporal or spatial scope explicit for counting, spatial, endpoint, viewpoint, and dialogue reasoning. External tools such as open-vocabulary detection, depth cues, pair crops, ASR, and scene-graph ledgers are used only as evidence sources. A conservative gate keeps the current answer unless independent evidence uniquely supports a different option. The final evidence-gated pipeline achieves 92.95% overall accuracy and 93.79% macro accuracy on the challenge test split.
翻译:VRR-QA挑战评估视频中的视觉关系推理能力,其答案往往依赖于隐含的空间关系、事件边界、目标身份和对话上下文,而非单一显著帧。我们提出了一种基于强GPT-5.5视频问答求解器与问题感知证据账簿的测试时推理流水线。初始求解器从统一视频表征中回答每个问题,而路由账簿则被提示明确解析所需目标、计数单元、参考帧及时间/空间范围以支持计数、空间、端点、视角和对话推理。外部工具如开放词汇检测、深度线索、配对裁剪、自动语音识别(ASR)和场景图账簿仅作为证据来源使用。保守门控机制维持当前答案,除非独立证据唯一支持其他选项。最终采用证据门控的流水线在挑战测试集上实现了92.95%的总准确率和93.79%的宏平均准确率。