As aerial platforms evolve from passive observers to active manipulators, the challenge shifts toward designing intuitive interfaces that allow non-expert users to command these systems naturally. This work introduces a novel concept of autonomous aerial manipulation system capable of interpreting high-level natural language commands to retrieve objects and deliver them to a human user. The system is intended to integrate a MediaPipe based on Grounding DINO and a Vision-Language-Action (VLA) model with a custom-built drone equipped with a 1-DOF gripper and an Intel RealSense RGB-D camera. VLA performs semantic reasoning to interpret the intent of a user prompt and generates a prioritized task queue for grasping of relevant objects in the scene. Grounding DINO and dynamic A* planning algorithm are used to navigate and safely relocate the object. To ensure safe and natural interaction during the handover phase, the system employs a human-centric controller driven by MediaPipe. This module provides real-time human pose estimation, allowing the drone to employ visual servoing to maintain a stable, distinct position directly in front of the user, facilitating a comfortable handover. We demonstrate the system's efficacy through real-world experiments for localization and navigation, which resulted in a 0.164m, 0.070m, and 0.084m of max, mean euclidean, and root-mean squared errors, respectively, highlighting the feasibility of VLA for aerial manipulation operations.
翻译:随着空中平台从被动观测者演变为主动操控者,设计直观界面以允许非专业用户自然地指挥这些系统成为新的挑战。本研究提出一种新型自主空中操控系统概念,能够解析高级自然语言指令以抓取物体并递交给人类用户。该系统旨在整合基于Grounding DINO的MediaPipe、视觉-语言-动作(VLA)模型以及搭载单自由度夹持器与英特尔RealSense RGB-D相机的定制无人机。VLA通过语义推理解析用户指令意图,生成场景中相关物体的优先级抓取任务队列。Grounding DINO与动态A*规划算法被用于导航并安全转移物体。为确保交接阶段的安全自然交互,系统采用由MediaPipe驱动的人体中心控制器。该模块提供实时人体姿态估计,使无人机能够通过视觉伺服保持在用户正前方的稳定精确位置,实现舒适交接。我们通过实际定位与导航实验验证系统效能,其最大欧氏误差、平均欧氏误差和均方根误差分别为0.164米、0.070米和0.084米,凸显了VLA模型在空中操控任务中的可行性。