Offline Risk-sensitive RL with Partial Observability to Enhance Performance in Human-Robot Teaming

The integration of physiological computing into mixed-initiative human-robot interaction systems offers valuable advantages in autonomous task allocation by incorporating real-time features as human state observations into the decision-making system. This approach may alleviate the cognitive load on human operators by intelligently allocating mission tasks between agents. Nevertheless, accommodating a diverse pool of human participants with varying physiological and behavioral measurements presents a substantial challenge. To address this, resorting to a probabilistic framework becomes necessary, given the inherent uncertainty and partial observability on the human's state. Recent research suggests to learn a Partially Observable Markov Decision Process (POMDP) model from a data set of previously collected experiences that can be solved using Offline Reinforcement Learning (ORL) methods. In the present work, we not only highlight the potential of partially observable representations and physiological measurements to improve human operator state estimation and performance, but also enhance the overall mission effectiveness of a human-robot team. Importantly, as the fixed data set may not contain enough information to fully represent complex stochastic processes, we propose a method to incorporate model uncertainty, thus enabling risk-sensitive sequential decision-making. Experiments were conducted with a group of twenty-six human participants within a simulated robot teleoperation environment, yielding empirical evidence of the method's efficacy. The obtained adaptive task allocation policy led to statistically significant higher scores than the one that was used to collect the data set, allowing for generalization across diverse participants also taking into account risk-sensitive metrics.

翻译：将生理计算融入混合主动性人机交互系统中，通过将实时特征作为人类状态观测值纳入决策系统，为自主任务分配提供了重要优势。该方法通过智能分配智能体间的任务，可减轻人类操作员的认知负荷。然而，如何适应具有不同生理和行为测量的多样化人类参与者群体构成重大挑战。鉴于人类状态的固有不确定性和部分可观测性，采用概率框架成为必要手段。最新研究表明，可从先前收集的经验数据集中学习部分可观测马尔可夫决策过程（POMDP）模型，并通过离线强化学习（ORL）方法求解。本研究不仅凸显了部分可观测表征与生理测量在提升人类操作员状态估计及性能方面的潜力，更增强了人机团队的整体任务效能。值得注意的是，由于固定数据集可能无法完整表征复杂随机过程，我们提出将模型不确定性纳入考量的方法，从而实现风险敏感的序贯决策。在包含26名人类参与者的模拟机器人遥操作环境中开展实验，实证了该方法的有效性。所获自适应任务分配策略相较于数据采集所用策略，在统计上显著提升了分数，同时实现了跨不同参与者的泛化能力，并纳入了风险敏感指标。