Video Scene Graph Generation (VidSGG) aims to represent dynamic visual content by detecting objects and modeling their temporal interactions as structured graphs. Prior studies typically target either coarse-grained box-level or fine-grained panoptic pixel-level VidSGG, often requiring task-specific architectures and multi-stage training pipelines. In this paper, we present UNO (UNified Object-centric VidSGG), a single-stage, unified framework that jointly addresses both tasks within an end-to-end architecture. UNO is designed to minimize task-specific modifications and maximize parameter sharing, enabling generalization across different levels of visual granularity. The core of UNO is an extended slot attention mechanism that decomposes visual features into object and relation slots. To ensure robust temporal modeling, we introduce object temporal consistency learning, which enforces consistent object representations across frames without relying on explicit tracking modules. Additionally, a dynamic triplet prediction module links relation slots to corresponding object pairs, capturing evolving interactions over time. We evaluate UNO on standard box-level and pixel-level VidSGG benchmarks. Results demonstrate that UNO not only achieves competitive performance across both tasks but also offers improved efficiency through a unified, object-centric design. Code is available at: https://github.com/Fsoft-AIC/UNO
翻译:视频场景图生成(VidSGG)旨在通过检测对象并将其时序交互建模为结构化图来表示动态视觉内容。先前的研究通常针对粗粒度边界框级别或细粒度全景像素级别的VidSGG,往往需要任务特定的架构和多阶段训练流程。本文提出UNO(统一以对象为中心的视频场景图生成),这是一个单阶段的统一框架,可在端到端架构中同时处理这两项任务。UNO的设计旨在最小化任务特定的修改并最大化参数共享,从而实现在不同视觉粒度级别上的泛化。UNO的核心是一个扩展的槽注意力机制,它将视觉特征分解为对象槽和关系槽。为了确保鲁棒的时序建模,我们引入了对象时序一致性学习,该机制在不依赖显式跟踪模块的情况下,强制跨帧的对象表征保持一致。此外,动态三元组预测模块将关系槽与相应的对象对进行关联,以捕捉随时间演变的交互。我们在标准的边界框级别和像素级别VidSGG基准上评估了UNO。结果表明,UNO不仅在这两项任务上均取得了有竞争力的性能,而且通过统一的、以对象为中心的设计提供了更高的效率。代码发布于:https://github.com/Fsoft-AIC/UNO