Audo-Sight: AI-driven Ambient Perception Across Edge-Cloud for Blind and Low Vision Users

Despite advances in assistive technologies, Blind and Low-Vision (BLV) individuals continue to face challenges in understanding their surroundings. Delivering concise, useful, and timely scene descriptions for ambient perception remains a long-standing accessibility problem. To address this, we introduce Audo-Sight, an AI-driven assistive system across Edge-Cloud that enables BLV individuals to perceive their surroundings through voice-based conversational interaction. Audo-Sight employs a set of expert and generic AI agents, each supported by dedicated processing pipelines distributed across edge and cloud. It analyzes user queries by considering urgency and contextual information to infer the user intent and dynamically route each query, along with a scene frame, to the most suitable pipeline. In cases where users require fast responses, the system simultaneously leverages edge and cloud processing pipelines. The edge generates an initial response quickly, while the cloud provides more detailed and accurate information. To overcome the challenge of seamlessly combining these outputs, we introduce the Response Fusion Engine, which fuses the fast edge response with the more accurate cloud output, ensuring timely and high-accuracy response for the BLV users. Systematic evaluation shows that Audo-Sight delivers speech output around 80% faster for urgent tasks and generates complete responses approximately 50% faster across all tasks compared to a commercial cloud-based solution -- highlighting the effectiveness of our system across edge-cloud. Human evaluation of Audo-Sight shows that it is the preferred choice over GPT-5 for 62% of BLV participants with another 23% stating both perform comparably.

翻译：尽管辅助技术取得了进步，盲人与低视力（BLV）人群在理解周围环境方面仍面临挑战。为环境感知提供简洁、有用且及时的场景描述，仍然是一个长期存在的可访问性难题。为此，我们提出了Audo-Sight，一个跨边缘-云端的人工智能驱动辅助系统，使BLV用户能够通过基于语音的对话交互来感知周围环境。Audo-Sight采用一组专用和通用的AI智能体，每个智能体由分布在边缘和云端的专用处理流水线支持。该系统通过考虑查询的紧急程度和上下文信息来分析用户问题，以推断用户意图，并动态地将每个查询连同场景帧路由至最合适的处理流水线。在用户需要快速响应的情况下，系统会同时利用边缘和云端的处理流水线。边缘端快速生成初步响应，而云端则提供更详细和准确的信息。为了克服无缝融合这些输出的挑战，我们引入了响应融合引擎，它将快速的边缘响应与更准确的云端输出进行融合，从而确保为BLV用户提供及时且高精度的响应。系统性评估表明，与基于云的商业解决方案相比，Audo-Sight在紧急任务中的语音输出速度快约80%，在所有任务中生成完整响应的速度大约快50%——这凸显了我们系统在跨边缘-云端架构上的有效性。对Audo-Sight的人工评估显示，62%的BLV参与者更倾向于选择它而非GPT-5，另有23%的参与者认为两者表现相当。