Understanding of human visual perception has historically inspired the design of computer vision architectures. As an example, perception occurs at different scales both spatially and temporally, suggesting that the extraction of salient visual information may be made more effective by paying attention to specific features at varying scales. Visual changes in the body due to physiological processes also occur at different scales and with modality-specific characteristic properties. Inspired by this, we present BigSmall, an efficient architecture for physiological and behavioral measurement. We present the first joint camera-based facial action, cardiac, and pulmonary measurement model. We propose a multi-branch network with wrapping temporal shift modules that yields both accuracy and efficiency gains. We observe that fusing low-level features leads to suboptimal performance, but that fusing high level features enables efficiency gains with negligible loss in accuracy. Experimental results demonstrate that BigSmall significantly reduces the computational costs. Furthermore, compared to existing task-specific models, BigSmall achieves comparable or better results on multiple physiological measurement tasks simultaneously with a unified model.
翻译:人类视觉感知的理解历来启发着计算机视觉架构的设计。例如,感知在空间和时间上以不同尺度发生,这表明通过关注不同尺度上的特定特征,可以更有效地提取显著视觉信息。由生理过程引起的身体视觉变化也以不同尺度发生,并具有模态特定的特征属性。受此启发,我们提出了BigSmall——一种用于生理和行为测量的高效架构。我们首次提出了基于摄像头的联合面部动作、心率和呼吸测量模型。我们设计了一个包含包裹式时间移位模块的多分支网络,同时实现了精度和效率提升。我们发现融合低级特征会导致性能次优,而融合高级特征则可在精度损失可忽略的情况下提升效率。实验结果表明,BigSmall显著降低了计算成本。此外,与现有任务专用模型相比,BigSmall通过统一模型在多个生理测量任务上同时取得了相当或更优的结果。