Large language model (LLM) agents are increasingly deployed as scalable user simulators for recommender system evaluation. Yet existing simulators perceive recommendations through text or structured metadata rather than the visual interfaces real users browse-a critical gap, since attention over recommendation layouts is both visually driven and highly personalized. We investigate whether aligning a vision-language model's (VLM's) visual attention with user-specific gaze patterns can improve simulation fidelity. Analysis of a real-world eye-tracking dataset collected in a carousel-based recommendation setting reveals that users exhibit stable individual gaze patterns strongly predictive of click behavior. Building on this finding, we propose Fixation-Aligned Tuning for user Emulation (FixATE). Our approach first probes the VLM's internal visual attention via interpretability operators to obtain a slot-level relevance distribution comparable with human fixation, and then learns personalized soft prompts to steer the model's attention toward each user's characteristic fixation pattern. Experiments across three interpretability-based probing operators and two architecturally distinct VLM backbones demonstrate consistent improvements in both attention alignment and click prediction accuracy. These results suggest that making the model "see like the user" is a viable path toward simulators that more faithfully reproduce how users perceive and act in recommendation interfaces.
翻译:大型语言模型(LLM)智能体正越来越多地被部署为推荐系统评估中的可扩展用户模拟器。然而,现有模拟器通过文本或结构化元数据感知推荐内容,而非真实用户浏览的视觉界面——这是一个关键差距,因为对推荐布局的注意力既受视觉驱动又高度个性化。我们研究将视觉语言模型(VLM)的视觉注意力与用户特定的注视模式对齐能否提高模拟保真度。对在轮播式推荐场景中收集的真实世界眼动追踪数据集进行分析发现,用户表现出稳定的个性化注视模式,这与点击行为高度相关。基于这一发现,我们提出用于用户模拟的注视对齐调优(FixATE)。该方法首先通过可解释性算子探查VLM的内部视觉注意力,获取与人类注视分布可比的槽级相关性分布,然后学习个性化软提示,将模型注意力引导至每位用户的特征注视模式。跨三种基于可解释性的探查算子及两种架构不同的VLM骨干网络的实验表明,在注意力对齐和点击预测精度上均实现了一致提升。这些结果表明,让模型“像用户一样看”是通往更忠实再现用户在推荐界面中感知与行为方式的模拟器的可行路径。