Real-time conversational assistants for procedural tasks often depend on video input, which can be computationally expensive and compromise user privacy. For the first time, we propose a real-time conversational assistant that provides comprehensive guidance for a procedural task using only lightweight privacy-preserving modalities such as audio and IMU inputs from a user's wearable device to understand the context. This assistant proactively communicates step-by-step instructions to a user performing a furniture assembly task, and answers user questions. We construct a dataset containing conversations where the assistant guides the user in performing the task. On observing that an off-the-shelf language model is a very talkative assistant, we design a novel User Whim Agnostic (UWA) LoRA finetuning method which improves the model's ability to suppress less informative dialogues, while maintaining its tendency to communicate important instructions. This leads to >30% improvement in the F-score. Finetuning the model also results in a 16x speedup by eliminating the need to provide in-context examples in the prompt. We further describe how such an assistant is implemented on edge devices with no dependence on the cloud.
翻译:面向程序性任务的实时对话助手通常依赖视频输入,这可能导致较高的计算开销并损害用户隐私。我们首次提出一种实时对话助手,仅通过轻量级隐私保护模态(如来自用户可穿戴设备的音频与IMU输入)来理解上下文,从而为程序性任务提供全面指导。该助手会主动向执行家具组装任务的用户传达分步指令,并回答用户问题。我们构建了一个包含助手指导用户执行任务的对话数据集。通过观察发现现成的语言模型存在过度冗余对话倾向,我们设计了一种新颖的用户随机偏好无关(UWA)LoRA微调方法,该方法在保持模型传达重要指令倾向的同时,显著提升了其抑制低信息量对话的能力,使F分数提升超过30%。模型微调还通过消除提示中上下文示例的需求,实现了16倍的加速。我们进一步阐述了此类助手如何在不依赖云端的情况下在边缘设备上实现部署。