We address vision-language-driven indoor navigation to assist visually impaired individuals in reaching a target location using images and natural language guidance. Traditional navigation systems are ineffective indoors due to the lack of precise location data. Our approach integrates vision and language models to generate step-by-step navigational instructions, enhancing accessibility and independence. We fine-tune the BLIP-2 model with Low Rank Adaptation (LoRA) on a manually annotated indoor navigation dataset. We propose an evaluation metric that refines the BERT F1 score by emphasizing directional and sequential variables, providing a more comprehensive measure of navigational performance. After applying LoRA, the model significantly improved in generating directional instructions, overcoming limitations in the original BLIP-2 model.
翻译:我们研究基于视觉-语言驱动的室内导航,旨在通过图像和自然语言引导帮助视障人士抵达目标位置。由于缺乏精确的位置数据,传统导航系统在室内环境中效果有限。我们的方法融合视觉与语言模型,生成逐步导航指令,从而提升可访问性与独立性。我们在手动标注的室内导航数据集上采用低秩自适应(LoRA)技术对BLIP-2模型进行微调。我们提出一种改进BERT F1分数的评估指标,通过强化方向与序列变量的权重,为导航性能提供更全面的度量标准。应用LoRA后,模型在生成方向性指令方面显著提升,克服了原始BLIP-2模型的局限性。