Human pose estimation (HPE) with convolutional neural networks (CNNs) for indoor monitoring is one of the major challenges in computer vision. In contrast to HPE in perspective views, an indoor monitoring system can consist of an omnidirectional camera with a field of view of 180{\deg} to detect the pose of a person with only one sensor per room. To recognize human pose, the detection of keypoints is an essential upstream step. In our work we propose a new dataset for training and evaluation of CNNs for the task of keypoint detection in omnidirectional images. The training dataset, THEODORE+, consists of 50,000 images and is created by a 3D rendering engine, where humans are randomly walking through an indoor environment. In a dynamically created 3D scene, persons move randomly with simultaneously moving omnidirectional camera to generate synthetic RGB images and 2D and 3D ground truth. For evaluation purposes, the real-world PoseFES dataset with two scenarios and 701 frames with up to eight persons per scene was captured and annotated. We propose four training paradigms to finetune or re-train two top-down models in MMPose and two bottom-up models in CenterNet on THEODORE+. Beside a qualitative evaluation we report quantitative results. Compared to a COCO pretrained baseline, we achieve significant improvements especially for top-view scenes on the PoseFES dataset. Our datasets can be found at https://www.tu-chemnitz.de/etit/dst/forschung/comp_vision/datasets/index.php.en.
翻译:使用卷积神经网络(CNN)进行室内监控的人体姿态估计(HPE)是计算机视觉领域的主要挑战之一。与透视视图中的HPE不同,室内监控系统可由单个视场角为180°的全向相机组成,每间房间仅需一个传感器即可检测人体姿态。为识别人体姿态,关键点检测是重要的上游步骤。本研究提出一个新数据集,用于训练和评估面向全向图像关键点检测任务的CNN。训练数据集THEODORE+包含50,000张图像,通过3D渲染引擎生成,其中人体在室内环境中随机行走。在动态创建的3D场景中,人物随机移动,同时全向相机同步运动,以生成合成RGB图像及二维/三维真值。为进行性能评估,我们采集并标注了包含两个场景、701帧(每帧最多八人)的真实世界PoseFES数据集。我们提出四种训练范式,基于THEODORE+对MMPose中两种自上而下模型和CenterNet中两种自下而上模型进行微调或重训练。除定性评估外,我们报告了量化结果。与基于COCO预训练的基线相比,我们在PoseFES数据集的俯视场景中取得了显著改进。我们的数据集可在https://www.tu-chemnitz.de/etit/dst/forschung/comp_vision/datasets/index.php.en获取。