This paper reports on the LoViF 2026 PhyScore challenge, a competition on holistic quality assessment of world-model-generated videos across both 2D and 4D generation settings. The challenge is motivated by a central gap in current evaluation practice: perceptual quality alone is insufficient to judge whether generated dynamics are physically plausible, temporally coherent, and consistent with input conditions. Participants are required to build a metric that jointly predicts four dimensions, i.e., Video Quality, Physical Realism, Condition-Video Alignment, and Temporal Consistency. Depart from that, participants also need to localize physical anomaly timestamps for fine-grained diagnosis. The benchmark dataset contains 1,554 videos generated by seven representative world generative models, organized into three tracks (text-2D, image-to-4D, and video-to-4D) and spanning 26 categories. These categories explicitly cover physics-relevant scenarios, including dynamics, optics, and thermodynamics, together with diverse real-world and creative content. To ensure label reliability, scores and anomaly timestamps are produced through trained human annotation with an additional automated quality-control pass. Evaluation is based on both score prediction and anomaly localization, with a composite protocol that combines TimeStamp_IOU and SRCC/PLCC. This report summarizes the challenge design and provides method-level insights from submitted solutions.
翻译:本文报告了LoViF 2026 PhyScore挑战赛,该竞赛旨在对基于世界模型生成的视频在2D和4D生成设置下进行全维度质量评估。该挑战源于当前评估实践中的一个核心空白:仅凭感知质量不足以判断生成动态是否具备物理合理性、时间连贯性以及与输入条件的一致性。参赛者需构建一个能联合预测四个维度的指标,即视频质量、物理真实性、条件-视频对齐度和时间一致性。此外,参赛者还需定位物理异常时间戳以实现细粒度诊断。基准数据集包含由七个代表性世界生成模型生成的1,554个视频,划分为三条赛道(文本转2D、图像转4D、视频转4D),涵盖26个类别。这些类别明确包含涉及物理学的场景,包括动力学、光学和热力学,以及多样的真实世界与创意内容。为确保标注可靠性,评分和异常时间戳通过经过培训的人工标注并结合额外的自动化质量控制流程生成。评估基于评分预测与异常定位两部分,采用结合TimeStamp_IOU与SRCC/PLCC的复合评估协议。本报告总结了挑战设计,并从提交的解决方案中提炼出方法层面的见解。