EKF-Based Depth Camera and Deep Learning Fusion for UAV-Person Distance Estimation and Following in SAR Operations

Search and rescue (SAR) operations require rapid responses to save lives or property. Unmanned Aerial Vehicles (UAVs) equipped with vision-based systems support these missions through prior terrain investigation or real-time assistance during the mission itself. Vision-based UAV frameworks aid human search tasks by detecting and recognizing specific individuals, then tracking and following them while maintaining a safe distance. A key safety requirement for UAV following is the accurate estimation of the distance between camera and target object under real-world conditions, achieved by fusing multiple image modalities. UAVs with deep learning-based vision systems offer a new approach to the planning and execution of SAR operations. As part of the system for automatic people detection and face recognition using deep learning, in this paper we present the fusion of depth camera measurements and monocular camera-to-body distance estimation for robust tracking and following. Deep learning-based filtering of depth camera data and estimation of camera-to-body distance from a monocular camera are achieved with YOLO-pose, enabling real-time fusion of depth information using the Extended Kalman Filter (EKF) algorithm. The proposed subsystem, designed for use in drones, estimates and measures the distance between the depth camera and the human body keypoints, to maintain the safe distance between the drone and the human target. Our system provides an accurate estimated distance, which has been validated against motion capture ground truth data. The system has been tested in real time indoors, where it reduces the average errors, root mean square error (RMSE) and standard deviations of distance estimation up to 15,3\% in three tested scenarios.

翻译：搜救（SAR）任务要求快速响应以挽救生命或财产。配备视觉系统的无人机（UAV）可通过事前地形勘察或任务期间的实时援助来支持这些行动。基于视觉的无人机框架通过检测与识别特定个体，随后在保持安全距离的同时对其进行跟踪与跟随，从而协助人工搜救任务。无人机跟随的一个关键安全要求是在真实世界条件下准确估计相机与目标物体之间的距离，这通过融合多种图像模态实现。配备基于深度学习的视觉系统的无人机为搜救任务的规划与执行提供了新途径。作为使用深度学习进行自动人员检测与人脸识别的系统组成部分，本文提出了深度相机测量与单目相机到人体距离估计的融合方法，以实现鲁棒的跟踪与跟随。基于YOLO-pose实现了深度相机数据的深度学习滤波以及从单目相机估计相机到人体距离，从而能够使用扩展卡尔曼滤波（EKF）算法实时融合深度信息。所提出的子系统专为无人机设计，可估计并测量深度相机与人体关键点之间的距离，以维持无人机与人体目标之间的安全距离。我们的系统提供了精确的距离估计值，并已通过运动捕捉真实数据验证。该系统已在室内进行实时测试，在三种测试场景中将距离估计的平均误差、均方根误差（RMSE）和标准差降低了高达15.3%。