We introduce ReXTime, a benchmark designed to rigorously test AI models' ability to perform temporal reasoning within video events. Specifically, ReXTime focuses on reasoning across time, i.e. human-like understanding when the question and its corresponding answer occur in different video segments. This form of reasoning, requiring advanced understanding of cause-and-effect relationships across video segments, poses significant challenges to even the frontier multimodal large language models. To facilitate this evaluation, we develop an automated pipeline for generating temporal reasoning question-answer pairs, significantly reducing the need for labor-intensive manual annotations. Our benchmark includes 921 carefully vetted validation samples and 2,143 test samples, each manually curated for accuracy and relevance. Evaluation results show that while frontier large language models outperform academic models, they still lag behind human performance by a significant 14.3% accuracy gap. Additionally, our pipeline creates a training dataset of 9,695 machine generated samples without manual effort, which empirical studies suggest can enhance the across-time reasoning via fine-tuning.
翻译:我们提出了ReXTime基准,旨在严格评估AI模型在视频事件中进行时序推理的能力。具体而言,ReXTime聚焦于跨时间推理,即当问题与其对应答案出现在不同视频片段时,模型需具备类人理解能力。这种推理形式要求对视频片段间的因果关系有深入理解,即使对前沿的多模态大语言模型也构成显著挑战。为促进此类评估,我们开发了自动生成时序推理问答对的流程,大幅减少了劳动密集型人工标注的需求。本基准包含921个经严格筛选的验证样本和2,143个测试样本,每个样本均经过人工审核以保证准确性与相关性。评估结果表明,虽然前沿大语言模型优于学术模型,但其准确率仍显著落后人类表现14.3%。此外,我们的流程无需人工干预即可生成9,695个机器生成的训练样本,实证研究表明通过微调可有效提升跨时间推理能力。