Smart glasses enhance interactions with the environment by using head-mounted cameras to observe the user's viewpoint, but lack the visual feedback used for common interactions. We introduce Gazeify then Voiceify, a multimodal approach allowing object selection via gaze and voice using displayless smart glasses. Users can select a physical object with their gaze, and the system generates a digital mask and a voice description of the object's semantics. Users can further correct errors through free-form conversation. To demonstrate our approach, we develop an interactive system by integrating advanced object segmentation and detection with a vision-language model. User studies reveal that participants achieve correct gaze selection in 53% of the task trials and use voice disambiguation to correct 58% of the remaining errors. Participants also rated the system as likable, useful, and easy to use.
翻译:智能眼镜通过头戴式摄像头观察用户视角以增强与环境的交互,但缺乏用于常见交互的视觉反馈。我们提出"凝视后语音化"这一多模态方法,允许用户通过无显示屏智能眼镜实现基于凝视与语音的对象选择。用户可通过凝视选择物理对象,系统将生成数字掩码并语音描述对象的语义信息。用户还可通过自由对话进一步纠正错误。为验证该方法,我们通过集成先进的对象分割检测技术与视觉语言模型开发了交互系统。用户研究表明,参与者在53%的任务尝试中实现了正确凝视选择,并通过语音消歧纠正了剩余错误中的58%。参与者同时评价该系统具有良好喜爱度、实用性和易用性。