This paper analyzes the design choices of face detection architecture that improve efficiency of computation cost and accuracy. Specifically, we re-examine the effectiveness of the standard convolutional block as a lightweight backbone architecture for face detection. Unlike the current tendency of lightweight architecture design, which heavily utilizes depthwise separable convolution layers, we show that heavily channel-pruned standard convolution layers can achieve better accuracy and inference speed when using a similar parameter size. This observation is supported by the analyses concerning the characteristics of the target data domain, faces. Based on our observation, we propose to employ ResNet with a highly reduced channel, which surprisingly allows high efficiency compared to other mobile-friendly networks (e.g., MobileNetV1, V2, V3). From the extensive experiments, we show that the proposed backbone can replace that of the state-of-the-art face detector with a faster inference speed. Also, we further propose a new feature aggregation method to maximize the detection performance. Our proposed detector EResFD obtained 80.4% mAP on WIDER FACE Hard subset which only takes 37.7 ms for VGA image inference on CPU. Code is available at https://github.com/clovaai/EResFD.
翻译:本文分析了人脸检测架构的设计选择,旨在提升计算成本效率与检测精度。具体而言,我们重新审视了标准卷积模块作为轻量级人脸检测骨干架构的有效性。与当前轻量级架构设计过度依赖深度可分离卷积层的趋势不同,我们证明在相似参数量下,经高度通道剪枝的标准卷积层能够实现更优的精度与推理速度。该发现得到了目标数据域(人脸)特性分析的支持。基于此观察,我们提出采用通道数大幅缩减的ResNet网络,其效率惊人地优于其他移动端友好网络(如MobileNetV1、V2、V3)。大量实验表明,所提骨干网络能以更快推理速度替代当前最优人脸检测器的骨干结构。此外,我们进一步提出一种新的特征聚合方法以最大化检测性能。我们提出的检测器EResFD在WIDER FACE Hard子集上达到80.4%的mAP,在CPU上对VGA图像推理仅需37.7毫秒。代码开源地址:https://github.com/clovaai/EResFD。