This paper presents UNCOM, a novel hybrid framework for interpreting natural human commands in tabletop scenarios. The system integrates multiple sources of information -- speech, gestures, and scene context -- to extract structured, actionable instructions for robots. Addressing the need for general-purpose human-robot interaction in domestic environments, UNCOM is designed for zero-shot operation, without reliance on predefined object models or training data specific to a given task. Using foundational and task-specific deep learning models, it allows out-of-the-box speech recognition, natural language understanding, gesture detection, and object segmentation. The modular architecture enhances transparency and explainability by explicitly parsing commands into object-action-target representations, enabling integration with symbolic robotic frameworks. We demonstrate the system in a TIAGo++ robot and provide an evaluation on a real-world data set of human-robot interaction scenarios; achieving an 82.39\% success rate over our benchmark data set, highlighting the robustness of the system to diversity, noise, and communication ambiguity. The data set, evaluation scenarios, and the code are publicly available to support future research.
翻译:摘要:本文提出UNCOM——一种面向桌面场景的混合框架,用于解析人类自然语言指令。该系统整合语音、手势与场景上下文等多源信息,为机器人提取结构化且可执行的行动指令。针对家用环境中通用人机交互的需求,UNCOM被设计为具备零样本操作能力,无需依赖预定义物体模型或特定任务的训练数据。通过运用基础模型与任务专用深度学习模型,该系统可实现即开即用的语音识别、自然语言理解、手势检测与目标分割。其模块化架构通过将指令显式解析为"对象-操作-目标"三元组表示,增强了系统的透明度与可解释性,并支持与符号化机器人框架的集成。我们在TIAGo++机器人上验证了该系统,并基于真实人机交互场景数据集进行了评估:基准数据集上的成功率达到82.39%,充分证明了系统对多样性、噪声及通信歧义的鲁棒性。为促进后续研究,本文所涉及的数据集、评估场景及代码均已公开。