Text-to-image diffusion models have achieved high visual fidelity, yet precise control over scene semantics and fine-grained affective tone remains challenging. Human visual affect arises from the rapid integration of contextual meaning, including valence, arousal, and dominance, with perceptual cues such as color harmony, luminance contrast, texture variation, curvature, and spatial layout. However, current text-to-image models rarely represent affective and perceptual factors within a unified representation, which limits their ability to synthesize scenes with coherent and nuanced emotional intent. To address this gap, we construct EmoScene, a large-scale dual-space emotion dataset that jointly encodes affective dimensions and perceptual attributes, with contextual semantics provided as supporting annotations. EmoScene contains 1.2M images across more than three hundred real-world scene categories, each annotated with discrete emotion labels, continuous VAD values, perceptual descriptors and textual captions. Multi-space analyses reveal how discrete emotions occupy the VAD space and how affect systematically correlates with scene-level perceptual factors. To benchmark EmoScene, we provide a lightweight reference baseline that injects dual-space controls into a frozen diffusion backbone via shallow cross-attention modulation, serving as a reproducible probe of affect controllability enabled by dual-space supervision.
翻译:文本到图像的扩散模型已实现高视觉保真度,但对场景语义及细粒度情感基调的精准控制仍具挑战性。人类视觉情感源于语境意义(包括效价、唤醒度和支配度)与感知线索(如色彩和谐、亮度对比、纹理变化、曲率和空间布局)的快速整合。然而,当前文本到图像模型鲜少在统一表征中编码情感与感知因素,导致其合成具有连贯且细腻情感意图的场景能力受限。为填补这一空白,我们构建了EmoScene——一个大规模双空间情感数据集,联合编码情感维度与感知属性,并以语境语义作为辅助标注。EmoScene包含跨越三百余个真实场景类别的120万张图像,每张图像均标注有离散情感标签、连续VAD值、感知描述符及文本描述。多空间分析揭示了离散情感在VAD空间中的分布模式,以及情感与场景级感知因素的系统性关联。为对EmoScene进行基准测试,我们提供了一个轻量级参考基线模型,该模型通过浅层交叉注意力调制将双空间控制注入冻结的扩散主干网络,作为双空间监督所赋予情感可控性的可复现探针。