People with blindness and low vision (pBLV) encounter substantial challenges when it comes to comprehensive scene recognition and precise object identification in unfamiliar environments. Additionally, due to the vision loss, pBLV have difficulty in accessing and identifying potential tripping hazards on their own. In this paper, we present a pioneering approach that leverages a large vision-language model to enhance visual perception for pBLV, offering detailed and comprehensive descriptions of the surrounding environments and providing warnings about the potential risks. Our method begins by leveraging a large image tagging model (i.e., Recognize Anything (RAM)) to identify all common objects present in the captured images. The recognition results and user query are then integrated into a prompt, tailored specifically for pBLV using prompt engineering. By combining the prompt and input image, a large vision-language model (i.e., InstructBLIP) generates detailed and comprehensive descriptions of the environment and identifies potential risks in the environment by analyzing the environmental objects and scenes, relevant to the prompt. We evaluate our approach through experiments conducted on both indoor and outdoor datasets. Our results demonstrate that our method is able to recognize objects accurately and provide insightful descriptions and analysis of the environment for pBLV.
翻译:盲人和低视力人群在陌生环境中进行全面的场景识别和精确目标定位时面临重大挑战。此外,由于视力受损,他们难以自行发现和识别潜在的行进障碍。本文提出了一种开创性方法,通过利用大型视觉-语言模型来增强盲人和低视力人群的视觉感知能力,为其提供周围环境的详细综合描述,并发出潜在风险警告。该方法首先利用大型图像标记模型(即Recognize Anything (RAM))识别采集图像中的常见物体;随后通过提示工程,将识别结果与用户查询整合为专为盲人和低视力人群定制的提示。通过将提示与输入图像相结合,大型视觉-语言模型(即InstructBLIP)可生成环境的详细综合描述,并通过分析与环境物体和场景相关的提示信息来识别潜在风险。我们在室内和室外数据集上开展实验,结果表明该方法能够准确识别物体,并为盲人和低视力人群提供富有洞见的环境描述与分析。