Existing human-robot interaction systems often lack mechanisms for sustained personalization and dynamic adaptation in multi-user environments, limiting their effectiveness in real-world deployments. We present HARMONI, a multimodal personalization framework that leverages large language models to enable socially assistive robots to manage long-term multi-user interactions. The framework integrates four key modules: (i) a perception module that identifies active speakers and extracts multimodal input; (ii) a world modeling module that maintains representations of the environment and short-term conversational context; (iii) a user modeling module that updates long-term speaker-specific profiles; and (iv) a generation module that produces contextually grounded and ethically informed responses. Through extensive evaluation and ablation studies on four datasets, as well as a real-world scenario-driven user-study in a nursing home environment, we demonstrate that HARMONI supports robust speaker identification, online memory updating, and ethically aligned personalization, outperforming baseline LLM-driven approaches in user modeling accuracy, personalization quality, and user satisfaction.
翻译:现有的人机交互系统在多用户环境中往往缺乏持续个性化和动态适应的机制,这限制了其在现实世界部署中的有效性。我们提出了HARMONI,一个利用大语言模型实现社交辅助机器人管理长期多用户交互的多模态个性化框架。该框架集成了四个关键模块:(i) 感知模块,用于识别活跃说话者并提取多模态输入;(ii) 世界建模模块,用于维护环境表示和短期对话上下文;(iii) 用户建模模块,用于更新长期、特定于说话者的用户档案;(iv) 生成模块,用于产生基于上下文且符合伦理准则的响应。通过在四个数据集上进行广泛评估与消融研究,以及在养老院环境中进行的现实场景驱动用户研究,我们证明HARMONI能够支持鲁棒的说话者识别、在线记忆更新和符合伦理的个性化,在用户建模准确性、个性化质量和用户满意度方面均优于基于大语言模型的基线方法。