Shopping is a routine activity for sighted individuals, yet for people who are blind or have low vision (pBLV), locating and retrieving products in physical environments remains a challenge. This paper presents a multimodal wearable assistive system that integrates object detection with vision-language models to support independent product or item retrieval, with the goal of enhancing users'autonomy and sense of agency. The system operates through three phases: product search, which identifies target products using YOLO-World detection combined with embedding similarity and color histogram matching; product navigation, which provides spatialized sonification and VLM-generated verbal descriptions to guide users toward the target; and product correction, which verifies whether the user has reached the correct product and provides corrective feedback when necessary. Technical evaluation demonstrated promising performance across all modules, with product detection achieving near-perfect accuracy at close range and high accuracy when facing shelves within 1.5 m. VLM-based navigation achieved up to 94.4% accuracy, and correction accuracy exceeded 86% under optimal model configurations. These results demonstrate the system's potential to address the last-meter problem in assistive shopping. Future work will focus on user studies with pBLV participants and integration with multi-scale navigation ecosystems.
翻译:购物对于视力正常者而言是日常活动,但对于盲人或低视力人群,在实体环境中定位与检索商品仍面临挑战。本文提出一种多模态可穿戴辅助系统,该系统通过整合目标检测与视觉语言模型,支持独立的产品或物品检索,旨在提升用户自主性与能动感。系统通过三个阶段运行:产品搜索阶段,采用YOLO-World检测结合嵌入相似度与颜色直方图匹配来识别目标产品;产品导航阶段,通过空间化声波合成与视觉语言模型生成的语音描述引导用户接近目标;产品校正阶段,验证用户是否抵达正确产品并在必要时提供纠正反馈。技术评估显示所有模块均表现优异:产品检测在近距离达到近乎完美的准确率,在1.5米内面向货架时仍保持高准确率;基于视觉语言模型的导航准确率最高达94.4%;校正模块在最优模型配置下准确率超过86%。这些结果证明了该系统在解决辅助购物"最后一米"难题方面的潜力。未来工作将聚焦于开展盲人或低视力参与者用户研究,并与多尺度导航生态系统进行集成。