Most micro- and macro-expression spotting methods in untrimmed videos suffer from the burden of video-wise collection and frame-wise annotation. Weakly-supervised expression spotting (WES) based on video-level labels can potentially mitigate the complexity of frame-level annotation while achieving fine-grained frame-level spotting. However, we argue that existing weakly-supervised methods are based on multiple instance learning (MIL) involving inter-modality, inter-sample, and inter-task gaps. The inter-sample gap is primarily from the sample distribution and duration. Therefore, we propose a novel and simple WES framework, MC-WES, using multi-consistency collaborative mechanisms that include modal-level saliency, video-level distribution, label-level duration and segment-level feature consistency strategies to implement fine frame-level spotting with only video-level labels to alleviate the above gaps and merge prior knowledge. The modal-level saliency consistency strategy focuses on capturing key correlations between raw images and optical flow. The video-level distribution consistency strategy utilizes the difference of sparsity in temporal distribution. The label-level duration consistency strategy exploits the difference in the duration of facial muscles. The segment-level feature consistency strategy emphasizes that features under the same labels maintain similarity. Experimental results on three challenging datasets -- CAS(ME)$^2$, CAS(ME)$^3$, and SAMM-LV -- demonstrate that MC-WES is comparable to state-of-the-art fully-supervised methods.
翻译:大部分面向未裁剪视频的微表情与宏表情定位方法,都受限于视频级采集和帧级标注的繁重负担。基于视频级标签的弱监督表情定位方法,能够在实现精细帧级定位的同时,有效缓解帧级标注的复杂性。然而,我们认为现有弱监督方法均基于多实例学习,存在模态间、样本间以及任务间的差异。其中样本间差异主要源于样本分布与时长。为此,我们提出了一种新颖且简洁的弱监督表情定位框架——MC-WES,该框架采用多一致性协同机制,包括模态级显著性、视频级分布、标签级时长以及片段级特征一致性策略,仅利用视频级标签即可实现精细的帧级定位,从而缓解上述差异并融合先验知识。其中,模态级显著性一致性策略专注于捕捉原始图像与光流之间的关键关联;视频级分布一致性策略利用了时间分布稀疏性的差异;标签级时长一致性策略利用了面部肌肉持续时间的差异;片段级特征一致性策略强调相同标签下的特征应保持相似性。在CAS(ME)$^2$、CAS(ME)$^3$和SAMM-LV三个挑战性数据集上的实验结果表明,MC-WES的性能与当前最先进的完全监督方法相当。