People with blindness and low vision (pBLV) encounter substantial challenges when it comes to comprehensive scene recognition and precise object identification in unfamiliar environments. Additionally, due to the vision loss, pBLV have difficulty in accessing and identifying potential tripping hazards on their own. In this paper, we present a pioneering approach that leverages a large vision-language model to enhance visual perception for pBLV, offering detailed and comprehensive descriptions of the surrounding environments and providing warnings about the potential risks. Our method begins by leveraging a large image tagging model (i.e., Recognize Anything (RAM)) to identify all common objects present in the captured images. The recognition results and user query are then integrated into a prompt, tailored specifically for pBLV using prompt engineering. By combining the prompt and input image, a large vision-language model (i.e., InstructBLIP) generates detailed and comprehensive descriptions of the environment and identifies potential risks in the environment by analyzing the environmental objects and scenes, relevant to the prompt. We evaluate our approach through experiments conducted on both indoor and outdoor datasets. Our results demonstrate that our method is able to recognize objects accurately and provide insightful descriptions and analysis of the environment for pBLV.
翻译:盲人和低视力人群(pBLV)在陌生环境中面临全面场景识别和精确物体识别的重大挑战。此外,由于视力丧失,他们难以自行发现并识别潜在绊倒危险。本文提出一种开创性方法,利用大型视觉语言模型增强pBLV的视觉感知能力,提供周围环境的详细全面描述并预警潜在风险。该方法首先采用大型图像标注模型(如Recognize Anything (RAM))识别捕获图像中的常见物体。随后,利用提示工程将识别结果与用户查询整合为专为pBLV定制的提示。通过结合该提示与输入图像,大型视觉语言模型(如InstructBLIP)可生成环境的详尽描述,并通过分析环境物体与场景识别与提示相关的潜在风险。我们在室内和室外数据集上进行实验评估,结果表明该方法能准确识别物体,为pBLV提供富有洞察力的环境描述与分析。