Representing artistic style is challenging due to its deep entanglement with semantic content. We propose StyleDecoupler, an information-theoretic framework that leverages a key insight: multi-modal vision models encode both style and content, while uni-modal models suppress style to focus on content-invariant features. By using uni-modal representations as content-only references, we isolate pure style features from multi-modal embeddings through mutual information minimization. StyleDecoupler operates as a plug-and-play module on frozen Vision-Language Models without fine-tuning. We also introduce WeART, a large-scale benchmark of 280K artworks across 152 styles and 1,556 artists. Experiments show state-of-the-art performance on style retrieval across WeART and WikiART, while enabling applications like style relationship mapping and generative model evaluation. We release our method and dataset at this url.
翻译:由于艺术风格与语义内容深度纠缠,其表示颇具挑战性。我们提出StyleDecoupler,这是一个基于信息论的框架,其核心思想在于:多模态视觉模型同时编码风格与内容,而单模态模型则抑制风格以专注于内容不变特征。通过将单模态表示作为纯内容参考,我们借助互信息最小化从多模态嵌入中分离出纯净的风格特征。StyleDecoupler可作为即插即用模块应用于冻结的视觉-语言模型,无需微调。我们还引入了WeART,一个包含152种风格和1,556位艺术家的28万件艺术作品的大规模基准数据集。实验表明,该方法在WeART和WikiART数据集上的风格检索任务中达到了最先进的性能,同时支持风格关系映射和生成模型评估等应用。我们已在指定网址开源方法及数据集。