Visually impaired individuals face significant challenges in environmental perception. Traditional assistive technologies often lack adaptive intelligence, focusing on individual components rather than integrated systems. While Vision-Language Models (VLMs) offer a promising path to richer, integrated understanding, their deployment is severely limited by substantial computational requirements, demanding dozens of gigabytes of memory. To address these gaps in computational efficiency and integrated design, this study proposes a dual technological innovation framework: a cross-modal differentiated quantization framework for VLMs and a scene-aware vectorized memory multi-agent system. The quantization framework implements differentiated strategies, reducing memory from 38GB to 11.3GB. The multi-agent system uses vectorized memory and perception-memory-reasoning workflows to provide environmental information beyond the current view, achieving 2.83-3.52s latency to initial speech output. Experiments show the quantized 19B-parameter model only experiences a 2.05% performance drop on MMBench and maintains 63.7 accuracy on OCR-VQA (original: 64.9), outperforming smaller models with equivalent memory. This research advances computational efficiency and assistive technology, offering comprehensive assistance in scene perception, text recognition, and navigation.
翻译:视障人士在环境感知方面面临重大挑战。传统辅助技术往往缺乏自适应智能,侧重于独立组件而非集成系统。尽管视觉语言模型为实现更丰富、集成的理解提供了有前景的路径,但其部署因巨大的计算需求而受到严重限制,通常需要数十吉字节的内存。为应对计算效率和集成设计方面的这些不足,本研究提出了一个双重技术创新框架:一个用于视觉语言模型的跨模态差异化量化框架,以及一个场景感知的向量化记忆多智能体系统。该量化框架实施差异化策略,将内存占用从38GB降至11.3GB。该多智能体系统利用向量化记忆和感知-记忆-推理工作流,提供超越当前视野的环境信息,实现首次语音输出的延迟为2.83-3.52秒。实验表明,量化后的190亿参数模型在MMBench上仅出现2.05%的性能下降,并在OCR-VQA上保持63.7的准确率(原始模型:64.9),其表现优于具有同等内存占用的较小模型。这项研究推进了计算效率和辅助技术的发展,为场景感知、文本识别和导航提供了全面的辅助。