Unsupervised video Object-Centric Learning (OCL) is promising as it enables object-level scene representation and understanding as we humans do. Mainstream video OCL methods adopt a recurrent architecture: An aggregator aggregates current video frame into object features, termed slots, under some queries; A transitioner transits current slots to queries for the next frame. This is an effective architecture but all existing implementations both (\textit{i1}) neglect to incorporate next frame features, the most informative source for query prediction, and (\textit{i2}) fail to learn transition dynamics, the knowledge essential for query prediction. To address these issues, we propose Random Slot-Feature pair for learning Query prediction (RandSF.Q): (\textit{t1}) We design a new transitioner to incorporate both slots and features, which provides more information for query prediction; (\textit{t2}) We train the transitioner to predict queries from slot-feature pairs randomly sampled from available recurrences, which drives it to learn transition dynamics. Experiments on scene representation demonstrate that our method surpass existing video OCL methods significantly, e.g., up to 10 points on object discovery, setting new state-of-the-art. Such superiority also benefits downstream tasks like scene understanding. Source Code, Model Checkpoints, Training Logs: https://github.com/Genera1Z/RandSF.Q
翻译:无监督视频对象中心学习(OCL)具有广阔前景,它能够像人类一样实现对象级的场景表征与理解。主流视频OCL方法采用循环架构:聚合器在特定查询条件下将当前视频帧聚合成对象特征(称为槽);转移器将当前槽转换为下一帧的查询。该架构虽有效,但现有实现均存在两方面不足:(i1)未纳入下一帧特征(查询预测最具信息量的来源);(i2)未能学习转移动态(查询预测必需的关键知识)。为解决这些问题,我们提出用于学习查询预测的随机槽-特征对方法(RandSF.Q):(t1)设计新型转移器以同时融合槽特征与帧特征,为查询预测提供更丰富信息;(t2)通过从可用循环中随机采样的槽-特征对训练转移器预测查询,驱动其学习转移动态。场景表征实验表明,本方法显著超越现有视频OCL方法(如在对象发现任务上提升高达10个点),创下最新技术水平。此优势亦惠及场景理解等下游任务。源代码、模型检查点、训练日志:https://github.com/Genera1Z/RandSF.Q