Bootstrapping Adaptive Human-Machine Interfaces with Offline Reinforcement Learning

Adaptive interfaces can help users perform sequential decision-making tasks like robotic teleoperation given noisy, high-dimensional command signals (e.g., from a brain-computer interface). Recent advances in human-in-the-loop machine learning enable such systems to improve by interacting with users, but tend to be limited by the amount of data that they can collect from individual users in practice. In this paper, we propose a reinforcement learning algorithm to address this by training an interface to map raw command signals to actions using a combination of offline pre-training and online fine-tuning. To address the challenges posed by noisy command signals and sparse rewards, we develop a novel method for representing and inferring the user's long-term intent for a given trajectory. We primarily evaluate our method's ability to assist users who can only communicate through noisy, high-dimensional input channels through a user study in which 12 participants performed a simulated navigation task by using their eye gaze to modulate a 128-dimensional command signal from their webcam. The results show that our method enables successful goal navigation more often than a baseline directional interface, by learning to denoise user commands signals and provide shared autonomy assistance. We further evaluate on a simulated Sawyer pushing task with eye gaze control, and the Lunar Lander game with simulated user commands, and find that our method improves over baseline interfaces in these domains as well. Extensive ablation experiments with simulated user commands empirically motivate each component of our method.

翻译：自适应界面可以帮助用户在噪声高维指令信号（如脑机接口信号）下完成序列决策任务（如机器人遥操作）。近期人机协同机器学习领域的进展使得此类系统能够通过用户交互实现性能提升，但实际应用中往往受限于单用户数据采集量。本文提出一种强化学习算法，通过离线预训练与在线微调相结合的方式，训练界面将原始指令信号映射为具体动作。针对噪声指令信号与稀疏奖励带来的挑战，我们开发了一种新颖方法，用于表征和推断用户对给定轨迹的长期意图。通过一项包含12名参与者的用户实验（参与者通过眼动调节网络摄像头产生的128维指令信号完成模拟导航任务），我们重点评估了该方法对仅能通过噪声高维输入通道进行通信的用户的辅助能力。结果表明，本方法通过学习去噪用户指令信号并提供共享自主性辅助，相较基线方向界面更频繁地实现成功目标导航。我们进一步在眼动控制的模拟Sawyer推物任务及模拟用户指令的Lunar Lander游戏中进行了评估，证实本方法在这些领域同样优于基线界面。基于模拟用户指令的广泛消融实验从实证角度验证了本方法各组成部分的有效性。