Traditional approaches for analyzing RGB frames are capable of providing a fine-grained understanding of a face from different angles by inferring emotions, poses, shapes, landmarks. However, when it comes to subtle movements standard RGB cameras might fall behind due to their latency, making it hard to detect micro-movements that carry highly informative cues to infer the true emotions of a subject. To address this issue, the usage of event cameras to analyze faces is gaining increasing interest. Nonetheless, all the expertise matured for RGB processing is not directly transferrable to neuromorphic data due to a strong domain shift and intrinsic differences in how data is represented. The lack of labeled data can be considered one of the main causes of this gap, yet gathering data is harder in the event domain since it cannot be crawled from the web and labeling frames should take into account event aggregation rates and the fact that static parts might not be visible in certain frames. In this paper, we first present FACEMORPHIC, a multimodal temporally synchronized face dataset comprising both RGB videos and event streams. The data is labeled at a video level with facial Action Units and also contains streams collected with a variety of applications in mind, ranging from 3D shape estimation to lip-reading. We then show how temporal synchronization can allow effective neuromorphic face analysis without the need to manually annotate videos: we instead leverage cross-modal supervision bridging the domain gap by representing face shapes in a 3D space.
翻译:传统分析RGB帧的方法能够通过推断情绪、姿态、形状和关键点,从不同角度对面部进行细粒度理解。然而,当涉及细微运动时,标准RGB相机可能因其延迟性而表现不足,难以检测承载高信息量线索的微运动以推断主体的真实情绪。为解决这一问题,使用事件相机分析面部正受到日益增长的关注。尽管如此,由于强烈的域偏移和数据表示方式的内在差异,RGB处理领域积累的所有专业知识无法直接迁移至神经形态数据。标注数据的缺乏可被视为造成这一差距的主要原因之一,而在事件域中收集数据更为困难,因其无法从网络爬取,且标注帧需考虑事件聚合速率以及静态部分在某些帧中可能不可见的事实。本文首先提出FACEMORPHIC——一个包含RGB视频与事件流的多模态时间同步面部数据集。数据在视频级别标注了面部动作单元,并包含为多种应用场景(从3D形状估计到唇语识别)收集的数据流。随后,我们展示时间同步如何实现有效的神经形态面部分析而无需手动标注视频:通过将面部形状表示为三维空间中的特征,我们利用跨模态监督来弥合域间差距。