Designing policies that are both efficient and acceptable for conversational service robots in open and diverse environments is non-trivial. Unlike fixed, hand-tuned parameters, online learning can adapt to non-stationary conditions. In this paper, we study how to adapt a social robot's speech policy in the wild. During a 12-day in-situ deployment with over 1,400 public encounters, we cast online policy optimization as a multi-armed bandit problem and use Thompson sampling to select among six actions defined by speech rate (slow/normal/fast) and verbosity (concise/detailed). We compare three complementary binary rewards--Ru (user rating), Rc (conversation closure), and Rt (>=2 turns)--and show that each induces distinct arm distributions and interaction behaviors. We complement the online results with offline evaluations that analyze contextual factors (e.g., crowd level, group size) using video-annotated data. Taken together, we distill ready-to-use design lessons for deploying online optimization of speech policies in real public HRI settings.
翻译:为开放多元环境中的对话服务机器人设计既高效又可接受的策略并非易事。与固定、手动调优的参数不同,在线学习能够适应非平稳条件。本文研究了如何在真实场景中调整社交机器人的语音策略。在为期12天、涉及超过1400次公众互动的现场部署中,我们将在线策略优化建模为多臂老虎机问题,并使用汤普森采样在由语速(慢速/正常/快速)和详细程度(简洁/详细)定义的六种动作中进行选择。我们比较了三种互补的二元奖励——Ru(用户评分)、Rc(对话结束)和Rt(≥2轮对话)——并证明每种奖励会诱导不同的动作分布和交互行为。我们通过离线评估补充在线结果,利用视频标注数据分析情境因素(如人群密度、小组规模)。综合而言,我们提炼出可直接应用于真实公共人机交互场景中语音策略在线优化的设计经验。