Dynamic Open Vocabulary Enhanced Safe-landing with Intelligence (DOVESEI)

This work targets what we consider to be the foundational step for urban airborne robots, a safe landing. Our attention is directed toward what we deem the most crucial aspect of the safe landing perception stack: segmentation. We present a streamlined reactive UAV system that employs visual servoing by harnessing the capabilities of open vocabulary image segmentation. This approach can adapt to various scenarios with minimal adjustments, bypassing the necessity for extensive data accumulation for refining internal models, thanks to its open vocabulary methodology. Given the limitations imposed by local authorities, our primary focus centers on operations originating from altitudes of 100 meters. This choice is deliberate, as numerous preceding works have dealt with altitudes up to 30 meters, aligning with the capabilities of small stereo cameras. Consequently, we leave the remaining 20m to be navigated using conventional 3D path planning methods. Utilizing monocular cameras and image segmentation, our findings demonstrate the system's capability to successfully execute landing maneuvers at altitudes as low as 20 meters. However, this approach is vulnerable to intermittent and occasionally abrupt fluctuations in the segmentation between frames in a video stream. To address this challenge, we enhance the image segmentation output by introducing what we call a dynamic focus: a masking mechanism that self adjusts according to the current landing stage. This dynamic focus guides the control system to avoid regions beyond the drone's safety radius projected onto the ground, thus mitigating the problems with fluctuations. Through the implementation of this supplementary layer, our experiments have reached improvements in the landing success rate of almost tenfold when compared to global segmentation. All the source code is open source and available online (github.com/MISTLab/DOVESEI).

翻译：摘要：本文针对城市空中机器人的基础性步骤——安全降落——展开研究。我们将目光聚焦于安全降落感知栈中最关键的环节：分割。我们提出了一种精简的无人机反应式系统，该系统通过利用开放词汇图像分割能力实现视觉伺服。得益于其开放词汇方法，该方法能在极少调整下适应多种场景，无需积累大量数据以优化内部模型。受当地法规限制，我们的研究重点集中在起始高度100米以上的操作。这一选择基于以下考量：此前诸多工作已针对30米以下高度（与小立体相机性能匹配）展开研究，因此我们将剩余20米高度交由传统三维路径规划方法处理。通过使用单目相机和图像分割，实验结果表明该系统能够在低至20米的高度成功执行降落操作。然而，该方法易受视频流中帧间分割结果的间歇性剧烈波动影响。为解决这一问题，我们引入一种称为“动态聚焦”的增强机制——一种根据当前降落阶段自适应调整的掩膜机制。该动态聚焦引导控制系统避开无人机安全半径在地面投影以外的区域，从而缓解分割波动问题。通过实施这一补充层，实验结果显示，与全局分割相比，降落成功率提升了近十倍。所有源代码均已开源，可在github.com/MISTLab/DOVESEI获取。