Make-up temporal video grounding (MTVG) aims to localize the target video segment which is semantically related to a sentence describing a make-up activity, given a long video. Compared with the general video grounding task, MTVG focuses on meticulous actions and changes on the face. The make-up instruction step, usually involving detailed differences in products and facial areas, is more fine-grained than general activities (e.g, cooking activity and furniture assembly). Thus, existing general approaches cannot locate the target activity effectually. More specifically, existing proposal generation modules are not yet fully developed in providing semantic cues for the more fine-grained make-up semantic comprehension. To tackle this issue, we propose an effective proposal-based framework named Dual-Path Temporal Map Optimization Network (DPTMO) to capture fine-grained multimodal semantic details of make-up activities. DPTMO extracts both query-agnostic and query-guided features to construct two proposal sets and uses specific evaluation methods for the two sets. Different from the commonly used single structure in previous methods, our dual-path structure can mine more semantic information in make-up videos and distinguish fine-grained actions well. These two candidate sets represent the cross-modal makeup video-text similarity and multi-modal fusion relationship, complementing each other. Each set corresponds to its respective optimization perspective, and their joint prediction enhances the accuracy of video timestamp prediction. Comprehensive experiments on the YouMakeup dataset demonstrate our proposed dual structure excels in fine-grained semantic comprehension.
翻译:化妆时序视频定位(MTVG)旨在给定长视频的情况下,定位与描述化妆活动的句子语义相关的目标视频片段。与通用视频定位任务相比,MTVG关注面部精细动作与变化。化妆指令步骤通常涉及产品与面部区域的细节差异,其细粒度高于通用活动(如烹饪活动、家具组装)。因此,现有通用方法无法有效定位目标活动。具体而言,现有提案生成模块在提供语义线索以支持更细粒度的化妆语义理解方面尚未充分发展。为解决此问题,我们提出一种基于提案的有效框架——双路径时序图优化网络(DPTMO),以捕捉化妆活动的细粒度多模态语义细节。DPTMO提取查询无关与查询引导两类特征,构建两组提案集合,并对每组采用特定评估方法。不同于以往方法中常用的单一结构,我们的双路径结构能够挖掘化妆视频中的更多语义信息,并有效区分细粒度动作。这两组候选集合分别表征跨模态化妆视频-文本相似性与多模态融合关系,两者互为补充。每组对应各自的优化视角,其联合预测增强了视频时间戳预测的准确性。在YouMakeup数据集上的全面实验表明,我们提出的双结构在细粒度语义理解方面表现优异。