Modern vision models are trained on very large noisy datasets. While these models acquire strong capabilities, they may not follow the user's intent to output the desired results in certain aspects, e.g., visual aesthetic, preferred style, and responsibility. In this paper, we target the realm of visual aesthetics and aim to align vision models with human aesthetic standards in a retrieval system. Advanced retrieval systems usually adopt a cascade of aesthetic models as re-rankers or filters, which are limited to low-level features like saturation and perform poorly when stylistic, cultural or knowledge contexts are involved. We find that utilizing the reasoning ability of large language models (LLMs) to rephrase the search query and extend the aesthetic expectations can make up for this shortcoming. Based on the above findings, we propose a preference-based reinforcement learning method that fine-tunes the vision models to distill the knowledge from both LLMs reasoning and the aesthetic models to better align the vision models with human aesthetics. Meanwhile, with rare benchmarks designed for evaluating retrieval systems, we leverage large multi-modality model (LMM) to evaluate the aesthetic performance with their strong abilities. As aesthetic assessment is one of the most subjective tasks, to validate the robustness of LMM, we further propose a novel dataset named HPIR to benchmark the alignment with human aesthetics. Experiments demonstrate that our method significantly enhances the aesthetic behaviors of the vision models, under several metrics. We believe the proposed algorithm can be a general practice for aligning vision models with human values.
翻译:现代视觉模型在规模庞大且噪声较多的数据集上训练而成。尽管这些模型获得了强大的能力,但在某些方面(例如视觉美学、偏好风格和责任感)可能无法遵循用户意图输出期望结果。本文聚焦视觉美学领域,旨在检索系统中将视觉模型与人类审美标准对齐。高级检索系统通常采用级联的美学模型作为重排序器或过滤器,但这类方法局限于饱和度等低级特征,在涉及风格、文化或知识背景时表现不佳。我们发现,利用大语言模型(LLM)的推理能力重新表述搜索查询并扩展美学期望,可以弥补这一缺陷。基于上述发现,我们提出一种基于偏好的强化学习方法,通过微调视觉模型来蒸馏LLM推理与美学模型的知识,使视觉模型与人类美学更好地对齐。同时,针对检索系统评估基准稀缺的现状,我们借助大型多模态模型(LMM)的强能力对美学性能进行评估。鉴于美学评估是最具主观性的任务之一,为验证LMM的鲁棒性,我们进一步提出名为HPIR的新数据集,用于衡量与人类美学的对齐程度。实验表明,在多项指标下,我们的方法显著提升了视觉模型的美学表现。我们相信,所提出的算法可作为将视觉模型与人类价值观对齐的通用实践。