Drones operating in human-occupied spaces suffer from insufficient communication mechanisms that create uncertainty about their intentions. We present HoverAI, an embodied aerial agent that integrates drone mobility, infrastructure-independent visual projection, and real-time conversational AI into a unified platform. Equipped with a MEMS laser projector, onboard semi-rigid screen, and RGB camera, HoverAI perceives users through vision and voice, responding via lip-synced avatars that adapt appearance to user demographics. The system employs a multimodal pipeline combining VAD, ASR (Whisper), LLM-based intent classification, RAG for dialogue, face analysis for personalization, and voice synthesis (XTTS v2). Evaluation demonstrates high accuracy in command recognition (F1: 0.90), demographic estimation (gender F1: 0.89, age MAE: 5.14 years), and speech transcription (WER: 0.181). By uniting aerial robotics with adaptive conversational AI and self-contained visual output, HoverAI introduces a new class of spatially-aware, socially responsive embodied agents for applications in guidance, assistance, and human-centered interaction.
翻译:在人类活动空间运行的无人机因通信机制不足而导致其意图表达不明确。本文提出HoverAI,一种将无人机机动性、独立于基础设施的视觉投影和实时对话式人工智能集成于统一平台的具身空中智能体。该系统配备MEMS激光投影仪、机载半刚性屏幕和RGB摄像头,通过视觉与语音感知用户,并经由口型同步的虚拟形象进行响应,该形象能根据用户人口统计特征自适应调整外观。系统采用多模态处理流程,结合语音活动检测(VAD)、自动语音识别(Whisper)、基于大语言模型的意图分类、用于对话的检索增强生成技术、用于个性化服务的人脸分析以及语音合成技术(XTTS v2)。评估结果表明,该系统在指令识别(F1分数:0.90)、人口特征估计(性别F1分数:0.89,年龄平均绝对误差:5.14岁)和语音转写(词错误率:0.181)方面均表现出较高精度。通过融合空中机器人技术、自适应对话式人工智能与自包含视觉输出能力,HoverAI开创了一类具有空间感知与社会响应能力的具身智能体新范式,可应用于导引、辅助及以人为中心的交互场景。