Understanding how people perceive and evaluate interior spaces is essential for designing environments that promote well-being. However, predicting aesthetic experiences remains difficult due to the subjective nature of perception and the complexity of visual responses. This study introduces a dual-branch CNN-LSTM framework that fuses visual features with eye-tracking signals to predict aesthetic evaluations of residential interiors. We collected a dataset of 224 interior design videos paired with synchronized gaze data from 28 participants who rated 15 aesthetic dimensions. The proposed model attains 72.2% accuracy on objective dimensions (e.g., light) and 66.8% on subjective dimensions (e.g., relaxation), outperforming state-of-the-art video baselines and showing clear gains on subjective evaluation tasks. Notably, models trained with eye-tracking retain comparable performance when deployed with visual input alone. Ablation experiments further reveal that pupil responses contribute most to objective assessments, while the combination of gaze and visual cues enhances subjective evaluations. These findings highlight the value of incorporating eye-tracking as privileged information during training, enabling more practical tools for aesthetic assessment in interior design.
翻译:理解人们如何感知和评价室内空间对于设计促进福祉的环境至关重要。然而,由于感知的主观性和视觉反应的复杂性,预测审美体验仍然困难。本研究引入了一个双分支CNN-LSTM框架,该框架融合了视觉特征与眼动信号,以预测对住宅室内环境的审美评价。我们收集了一个包含224个室内设计视频的数据集,并配对了来自28名参与者的同步注视数据,这些参与者对15个审美维度进行了评分。所提出的模型在客观维度(如光线)上达到72.2%的准确率,在主观维度(如放松度)上达到66.8%的准确率,优于最先进的视频基线模型,并在主观评价任务上显示出明显的提升。值得注意的是,使用眼动数据训练的模型在仅部署视觉输入时仍能保持相当的性能。消融实验进一步揭示,瞳孔反应对客观评估贡献最大,而注视与视觉线索的结合则增强了主观评价。这些发现凸显了在训练过程中融入眼动信号作为特权信息的价值,从而为室内设计中的审美评估提供了更实用的工具。