EmoAssist: Emotional Assistant for Visual Impairment Community

The rapid advancement of large multi-modality models (LMMs) has significantly propelled the integration of artificial intelligence into practical applications. Visual Question Answering (VQA) systems, which can process multi-modal data including vision, text, and audio, hold great potential for assisting the Visual Impairment (VI) community in navigating complex and dynamic real-world environments. However, existing VI assistive LMMs overlook the emotional needs of VI individuals, and current benchmarks lack emotional evaluation of these LMMs. To address these gaps, this paper introduces the EmoAssist Benchmark, a comprehensive benchmark designed to evaluate the assistive performance of LMMs for the VI community. To the best of our knowledge, this is the first benchmark that incorporates emotional intelligence as a key consideration. Furthermore, we propose the EmoAssist Model, an Emotion-Assistive LMM specifically designed for the VI community. The EmoAssist Model utilizes Direct Preference Optimization (DPO) to align outputs with human emotional preferences. Experiment results demonstrate that the EmoAssist Model significantly enhances the recognition of implicit emotions and intentions of VI users, delivers empathetic responses, and provides actionable guidance. Specifically, it shows respective improvements of 147.8% and 89.7% in the Empathy and Suggestion metrics on the EmoAssist Benchmark, compared to the pre-tuning LMM, and even outperforms state-of-the-art LLMs such as GPT-4o.

翻译：大型多模态模型（LMMs）的快速发展极大地推动了人工智能在实际应用中的集成。视觉问答（VQA）系统能够处理包括视觉、文本和音频在内的多模态数据，在协助视障（VI）群体应对复杂动态的现实环境方面具有巨大潜力。然而，现有的视障辅助LMMs忽视了视障个体的情感需求，且当前基准测试缺乏对这些LMMs的情感评估。为填补这些空白，本文提出了EmoAssist基准测试，这是一个旨在评估LMMs对视障群体辅助性能的综合基准。据我们所知，这是首个将情感智能作为关键考量因素的基准测试。此外，我们提出了EmoAssist模型，这是一个专为视障群体设计的情感辅助LMM。EmoAssist模型利用直接偏好优化（DPO）技术，使其输出与人类情感偏好对齐。实验结果表明，EmoAssist模型显著增强了对视障用户隐含情感与意图的识别能力，能提供共情式回应与可操作建议。具体而言，在EmoAssist基准测试的共情度与建议有效性指标上，相较于微调前的LMM，该模型分别实现了147.8%和89.7%的提升，甚至超越了GPT-4o等前沿大型语言模型。

相关内容

视觉识别系统

关注 11

视觉识别系统出自“头脑风暴”一词。所谓头脑风暴（Brain-storming）系统是运用系统的、统一的视觉符号系统。视觉识别是静态的识别符号具体化、视觉化的传达形式，项目最多，层面最广，效果更直接。视觉识别系统属于CIS中的VI，用完整、体系的视觉传达体系，将企业理念、文化特质、服务内容、企业规范等抽象语意转换为具体符号的概念，塑造出独特的企业形象。视觉识别系统分为基本要素系统和应用要素系统两方面。基本要素系统主要包括：企业名称、企业标志、标准字、标准色、象征图案、宣传口语、市场行销报告书等。应用系统主要包括：办公事务用品、生产设备、建筑环境、产品包装、广告媒体、交通工具、衣着制服、旗帜、招牌、标识牌、橱窗、陈列展示等。视觉识别（VI）在CI系统大众所接受，据有主导的地位。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日