We introduce ReXTime, a benchmark designed to rigorously test AI models' ability to perform temporal reasoning within video events. Specifically, ReXTime focuses on reasoning across time, i.e. human-like understanding when the question and its corresponding answer occur in different video segments. This form of reasoning, requiring advanced understanding of cause-and-effect relationships across video segments, poses significant challenges to even the frontier multimodal large language models. To facilitate this evaluation, we develop an automated pipeline for generating temporal reasoning question-answer pairs, significantly reducing the need for labor-intensive manual annotations. Our benchmark includes 921 carefully vetted validation samples and 2,143 test samples, each manually curated for accuracy and relevance. Evaluation results show that while frontier large language models outperform academic models, they still lag behind human performance by a significant 14.3% accuracy gap. Additionally, our pipeline creates a training dataset of 9,695 machine generated samples without manual effort, which empirical studies suggest can enhance the across-time reasoning via fine-tuning.
翻译:本文提出ReXTime基准测试套件,旨在系统评估AI模型在视频事件中进行时序推理的能力。该基准特别关注跨时间推理,即当问题及其对应答案出现在不同视频片段时,模型需要具备类人类的时序理解能力。这种需要深入理解视频片段间因果关系的推理形式,即使对前沿的多模态大语言模型也构成显著挑战。为推进相关评估,我们开发了自动化流水线用于生成时序推理问答对,大幅减少了劳动密集型人工标注的需求。本基准包含921个经过严格筛选的验证样本和2,143个测试样本,每个样本均经过人工校验以确保准确性与相关性。评估结果表明,虽然前沿大语言模型优于学术模型,但其准确率仍显著落后人类表现14.3个百分点。此外,我们的流水线无需人工干预即可生成9,695个机器生成的训练样本,实证研究表明通过微调可有效提升跨时间推理能力。