Body movement communicates intent at distances and in conditions where neither the face, nor speech can be captured. We study the recognition of communicative intent from 2D body pose alone. We argue that body motion is a reliable signal especially in scenarios that require real time low-cost on-device person-to-robot communication in long distance environments, such as rescue missions. However, existing resources do not isolate this signal. Affective corpora combine body, face, voice and text, while skeleton action-recognition benchmarks label the action performed rather than the message conveyed. We release a dataset of real frames of full-body pose covering ten communicative intents and we compare it against other real (IPC) and synthetic (MotionLCM, VEO3.1, Kimodo) ones that span a range of difficulty. We target systems that can run on a robot's limited onboard hardware. We benchmark multiple models, from skeleton graph classifiers to joint motion-forecasting networks, and report performance metrics together with frame rate on an embedded GPU (NVIDIA Orin~Nano), since speed matters as much as accuracy in our scenario. Finally, we show that a model's own autoregressive self-consistency works as an unsupervised reliability signal. We give a short proof that bounds the probability that a self-consistent prediction is correct, show that this probability grows with the number of consistent steps, and identify the conditions under which a confident prediction can still be false, benchmarked against industry-standard metrics.
翻译:人体动作可在远距离或无法捕捉面部及语音的场景中传递意图。本研究聚焦于仅通过二维人体姿态进行通信意图识别。我们认为,在需要实时低成本设备运行的人机远程通信场景(如救援任务)中,人体运动是一种可靠的信号。然而现有资源未能有效提取该信号——情感语料库整合了肢体、面部、语音及文本信息,而骨架动作识别基准仅标注执行的动作而非传递的信息。我们发布了一个包含十类通信意图的全身姿态实景帧数据集,并将其与跨越不同难度的真实数据(IPC)及合成数据(MotionLCM、VEO3.1、Kimodo)进行对比。研究针对可在机器人有限板载硬件上运行的模型,对从骨架图分类器到联合运动预测网络的多种模型进行基准测试,并在嵌入式GPU(NVIDIA Orin Nano)上同步报告性能指标与帧率——因为在此场景中速度与准确性同等重要。最终证明,模型自身的自回归自一致性可作为无监督可靠性信号。我们给出一个简要证明,界定了自一致性预测正确的概率上界,揭示了该概率随一致步数增加而增长的规律,并识别了置信预测仍可能错误的临界条件——该结论通过行业标准指标进行了实证验证。