Time-to-Collision (TTC) forecasting is a critical task in collision prevention, requiring precise temporal prediction and comprehending both local and global patterns encapsulated in a video, both spatially and temporally. To address the multi-scale nature of video, we introduce a novel spatiotemporal hierarchical transformer-based architecture called CollideNet, specifically catered for effective TTC forecasting. In the spatial stream, CollideNet aggregates information for each video frame simultaneously at multiple resolutions. In the temporal stream, along with multi-scale feature encoding, CollideNet also disentangles the non-stationarity, trend, and seasonality components. Our method achieves state-of-the-art performance in comparison to prior works on three commonly used public datasets, setting a new state-of-the-art by a considerable margin. We conduct cross-dataset evaluations to analyze the generalization capabilities of our method, and visualize the effects of disentanglement of the trend and seasonality components of the video data. We release our code at https://github.com/DeSinister/CollideNet/.
翻译:碰撞时间(TTC)预测是碰撞预防中的关键任务,需要精确的时间预测并理解视频中蕴含的全局与局部模式——包括空间与时间维度。针对视频的多尺度特性,我们提出了一种名为CollideNet的新型时空层次化Transformer架构,专门用于高效TTC预测。在空间流中,CollideNet同时以多分辨率聚合每帧视频帧的信息;在时间流中,除多尺度特征编码外,CollideNet还解耦了非平稳性、趋势性和季节性分量。在三个常用公开数据集上,我们的方法相比以往工作达到了最先进性能,并以显著优势树立了新的标杆。我们通过跨数据集评估分析了方法的泛化能力,并可视化解耦趋势与季节性分量对视频数据的影响。代码已发布至https://github.com/DeSinister/CollideNet/。