With the rapid growth of video centered social media, the ability to anticipate risky events from visual data is a promising direction for ensuring public safety and preventing real world accidents. Prior work has extensively studied supervised video risk assessment across domains such as driving, protests, and natural disasters. However, many existing datasets provide models with access to the full video sequence, including the accident itself, which substantially reduces the difficulty of the task. To better reflect real world conditions, we introduce a new video understanding benchmark RiskCueBench in which videos are carefully annotated to identify a risk signal clip, defined as the earliest moment that indicates a potential safety concern. Experimental results reveal a significant gap in current systems ability to interpret evolving situations and anticipate future risky events from early visual signals, highlighting important challenges for deploying video risk prediction models in practice.
翻译:随着以视频为中心的社交媒体快速增长,从视觉数据中预判风险事件成为保障公共安全和预防现实世界事故的一个前景广阔的方向。先前的研究已在驾驶、抗议活动和自然灾害等多个领域对监督式视频风险评估进行了广泛探索。然而,许多现有数据集为模型提供了包含事故本身在内的完整视频序列,这大大降低了任务的难度。为了更好地反映现实世界条件,我们引入了一个新的视频理解基准RiskCueBench,其中的视频经过精心标注以识别风险信号片段,该片段被定义为指示潜在安全关切的最早时刻。实验结果表明,当前系统在解读动态演变情境以及基于早期视觉信号预判未来风险事件的能力方面存在显著差距,突显了视频风险预测模型在实际部署中面临的重要挑战。