Video temporal character grouping locates appearing moments of major characters within a video according to their identities. To this end, recent works have evolved from unsupervised clustering to graph-based supervised clustering. However, graph methods are built upon the premise of fixed affinity graphs, bringing many inexact connections. Besides, they extract multi-modal features with kinds of models, which are unfriendly to deployment. In this paper, we present a unified and dynamic graph (UniDG) framework for temporal character grouping. This is accomplished firstly by a unified representation network that learns representations of multiple modalities within the same space and still preserves the modality's uniqueness simultaneously. Secondly, we present a dynamic graph clustering where the neighbors of different quantities are dynamically constructed for each node via a cyclic matching strategy, leading to a more reliable affinity graph. Thirdly, a progressive association method is introduced to exploit spatial and temporal contexts among different modalities, allowing multi-modal clustering results to be well fused. As current datasets only provide pre-extracted features, we evaluate our UniDG method on a collected dataset named MTCG, which contains each character's appearing clips of face and body and speaking voice tracks. We also evaluate our key components on existing clustering and retrieval datasets to verify the generalization ability. Experimental results manifest that our method can achieve promising results and outperform several state-of-the-art approaches.
翻译:视频时序角色分组旨在根据身份定位视频中主要角色的出现时刻。为此,近期研究已从无监督聚类发展为基于图的监督聚类方法。然而,图方法建立在固定亲和图的假设上,这带来了许多不精确的连接。此外,它们使用多种模型提取多模态特征,不利于部署。本文提出了一种统一动态图(UniDG)框架用于时序角色分组。首先,通过统一表征网络学习同一空间中多种模态的表征,同时保留模态的独特性。其次,提出动态图聚类,通过循环匹配策略为每个节点动态构建不同数量的邻居,从而生成更可靠的亲和图。第三,引入渐进式关联方法,利用不同模态间的空间和时间上下文,使多模态聚类结果得以良好融合。由于现有数据集仅提供预提取特征,我们在收集的数据集MTCG上评估了UniDG方法,该数据集包含每个角色的面部与身体出现片段及说话音频轨迹。我们还在现有聚类和检索数据集上评估了关键组件以验证泛化能力。实验结果表明,我们的方法能够取得令人满意的结果,并优于多种最先进方法。