The rapid advancement of large multi-modality models (LMMs) has significantly propelled the integration of artificial intelligence into practical applications. Visual Question Answering (VQA) systems, which can process multi-modal data including vision, text, and audio, hold great potential for assisting the Visual Impairment (VI) community in navigating complex and dynamic real-world environments. However, existing VI assistive LMMs overlook the emotional needs of VI individuals, and current benchmarks lack emotional evaluation of these LMMs. To address these gaps, this paper introduces the EmoAssist Benchmark, a comprehensive benchmark designed to evaluate the assistive performance of LMMs for the VI community. To the best of our knowledge, this is the first benchmark that incorporates emotional intelligence as a key consideration. Furthermore, we propose the EmoAssist Model, an Emotion-Assistive LMM specifically designed for the VI community. The EmoAssist Model utilizes Direct Preference Optimization (DPO) to align outputs with human emotional preferences. Experiment results demonstrate that the EmoAssist Model significantly enhances the recognition of implicit emotions and intentions of VI users, delivers empathetic responses, and provides actionable guidance. Specifically, it shows respective improvements of 147.8% and 89.7% in the Empathy and Suggestion metrics on the EmoAssist Benchmark, compared to the pre-tuning LMM, and even outperforms state-of-the-art LLMs such as GPT-4o.
翻译:大型多模态模型(LMMs)的快速发展极大地推动了人工智能在实际应用中的集成。视觉问答(VQA)系统能够处理包括视觉、文本和音频在内的多模态数据,在协助视障(VI)群体应对复杂动态的现实环境方面具有巨大潜力。然而,现有的视障辅助LMMs忽视了视障个体的情感需求,且当前基准测试缺乏对这些LMMs的情感评估。为填补这些空白,本文提出了EmoAssist基准测试,这是一个旨在评估LMMs对视障群体辅助性能的综合基准。据我们所知,这是首个将情感智能作为关键考量因素的基准测试。此外,我们提出了EmoAssist模型,这是一个专为视障群体设计的情感辅助LMM。EmoAssist模型利用直接偏好优化(DPO)技术,使其输出与人类情感偏好对齐。实验结果表明,EmoAssist模型显著增强了对视障用户隐含情感与意图的识别能力,能提供共情式回应与可操作建议。具体而言,在EmoAssist基准测试的共情度与建议有效性指标上,相较于微调前的LMM,该模型分别实现了147.8%和89.7%的提升,甚至超越了GPT-4o等前沿大型语言模型。