This paper presents PISHYAR, a socially intelligent smart cane designed by our group to combine socially aware navigation with multimodal human-AI interaction to support both physical mobility and interactive assistance. The system consists of two components: (1) a social navigation framework implemented on a Raspberry Pi 5 that integrates real-time RGB-D perception using an OAK-D Lite camera, YOLOv8-based object detection, COMPOSER-based collective activity recognition, D* Lite dynamic path planning, and haptic feedback via vibration motors for tasks such as locating a vacant seat; and (2) an agentic multimodal LLM-VLM interaction framework that integrates speech recognition, vision language models, large language models, and text-to-speech, with dynamic routing between voice-only and vision-only modes to enable natural voice-based communication, scene description, and object localization from visual input. The system is evaluated through a combination of simulation-based tests, real-world field experiments, and user-centered studies. Results from simulated and real indoor environments demonstrate reliable obstacle avoidance and socially compliant navigation, achieving an overall system accuracy of approximately 80% under different social conditions. Group activity recognition further shows robust performance across diverse crowd scenarios. In addition, a preliminary exploratory user study with eight visually impaired and low-vision participants evaluates the agentic interaction framework through structured tasks and a UTAUT-based questionnaire reveals high acceptance and positive perceptions of usability, trust, and perceived sociability during our experiments. The results highlight the potential of PISHYAR as a multimodal assistive mobility aid that extends beyond navigation to provide socially interactive support for such users.
翻译:本文介绍了PISHYAR,一种由我们团队设计的社交智能手杖,它结合了社交感知导航与多模态人机交互,旨在同时支持物理移动和交互式辅助。该系统包含两个组成部分:(1) 一个在Raspberry Pi 5上实现的社交导航框架,它集成了使用OAK-D Lite相机的实时RGB-D感知、基于YOLOv8的目标检测、基于COMPOSER的集体活动识别、D* Lite动态路径规划,以及通过振动电机实现的触觉反馈,用于完成诸如寻找空座位等任务;(2) 一个智能体驱动的多模态LLM-VLM交互框架,它集成了语音识别、视觉语言模型、大语言模型和文本转语音功能,并在纯语音和纯视觉模式之间进行动态路由,以实现基于语音的自然交流、场景描述以及基于视觉输入的目标定位。该系统通过基于仿真的测试、真实世界实地实验和以用户为中心的研究相结合的方式进行评估。在模拟和真实室内环境中的结果表明,系统能实现可靠的避障和符合社交规范的导航,在不同社交条件下的整体系统准确率约为80%。群体活动识别在多样化人群场景中也表现出稳健的性能。此外,一项包含八名视障和低视力参与者的初步探索性用户研究,通过结构化任务评估了智能体交互框架,并基于UTAUT的问卷显示,在我们的实验中,用户对该系统的可用性、信任度和感知社交性具有较高的接受度和积极评价。这些结果凸显了PISHYAR作为一种多模态辅助移动工具的潜力,其功能超越了导航,能为此类用户提供社交互动支持。