With the increasing demand for intelligent services of online video platforms, video character search task has attracted wide attention to support downstream applications like fine-grained retrieval and summarization. However, traditional solutions only focus on visual or coarse-grained social information and thus cannot perform well when facing complex scenes, such as changing camera view or character posture. Along this line, we leverage social information and scene context as prior knowledge to solve the problem of character search in complex scenes. Specifically, we propose a scene-prior-enhanced framework, named SoCoSearch. We first integrate multimodal clues for scene context to estimate the prior probability of social relationships, and then capture characters' co-occurrence to generate an enhanced social context graph. Afterwards, we design a social context-aware GCN framework to achieve feature passing between characters to obtain robust representation for the character search task. Extensive experiments have validated the effectiveness of SoCoSearch in various metrics.
翻译:随着在线视频平台智能服务需求的日益增长,视频角色搜索任务因支撑细粒度检索与摘要生成等下游应用而受到广泛关注。然而,传统解决方案仅聚焦于视觉或粗粒度的社交信息,难以应对复杂场景(如视角切换或姿态变化)带来的挑战。为此,本文利用社交信息与场景上下文作为先验知识,旨在解决复杂场景下的角色搜索问题。具体而言,我们提出了一种名为SoCoSearch的场景先验增强框架:首先融合多模态线索构建场景上下文,以估计社交关系的先验概率;继而捕获角色共现特征,生成增强型社交上下文图;随后设计社交上下文感知图卷积网络(GCN),实现角色间的特征传递,为角色搜索任务生成鲁棒表示。大量实验从多维度验证了SoCoSearch的有效性。