Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms

Modern vision models are trained on very large noisy datasets. While these models acquire strong capabilities, they may not follow the user's intent to output the desired results in certain aspects, e.g., visual aesthetic, preferred style, and responsibility. In this paper, we target the realm of visual aesthetics and aim to align vision models with human aesthetic standards in a retrieval system. Advanced retrieval systems usually adopt a cascade of aesthetic models as re-rankers or filters, which are limited to low-level features like saturation and perform poorly when stylistic, cultural or knowledge contexts are involved. We find that utilizing the reasoning ability of large language models (LLMs) to rephrase the search query and extend the aesthetic expectations can make up for this shortcoming. Based on the above findings, we propose a preference-based reinforcement learning method that fine-tunes the vision models to distill the knowledge from both LLMs reasoning and the aesthetic models to better align the vision models with human aesthetics. Meanwhile, with rare benchmarks designed for evaluating retrieval systems, we leverage large multi-modality model (LMM) to evaluate the aesthetic performance with their strong abilities. As aesthetic assessment is one of the most subjective tasks, to validate the robustness of LMM, we further propose a novel dataset named HPIR to benchmark the alignment with human aesthetics. Experiments demonstrate that our method significantly enhances the aesthetic behaviors of the vision models, under several metrics. We believe the proposed algorithm can be a general practice for aligning vision models with human values.

翻译：现代视觉模型在规模庞大且噪声较多的数据集上训练而成。尽管这些模型获得了强大的能力，但在某些方面（例如视觉美学、偏好风格和责任感）可能无法遵循用户意图输出期望结果。本文聚焦视觉美学领域，旨在检索系统中将视觉模型与人类审美标准对齐。高级检索系统通常采用级联的美学模型作为重排序器或过滤器，但这类方法局限于饱和度等低级特征，在涉及风格、文化或知识背景时表现不佳。我们发现，利用大语言模型（LLM）的推理能力重新表述搜索查询并扩展美学期望，可以弥补这一缺陷。基于上述发现，我们提出一种基于偏好的强化学习方法，通过微调视觉模型来蒸馏LLM推理与美学模型的知识，使视觉模型与人类美学更好地对齐。同时，针对检索系统评估基准稀缺的现状，我们借助大型多模态模型（LMM）的强能力对美学性能进行评估。鉴于美学评估是最具主观性的任务之一，为验证LMM的鲁棒性，我们进一步提出名为HPIR的新数据集，用于衡量与人类美学的对齐程度。实验表明，在多项指标下，我们的方法显著提升了视觉模型的美学表现。我们相信，所提出的算法可作为将视觉模型与人类价值观对齐的通用实践。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日