User simulators are essential for evaluating search systems, but they primarily reproduce user actions without modeling the underlying thought process. Large-scale interaction logs record what users do, but not what they might be thinking or feeling, such as confusion or satisfaction. We present a framework for inferring cognitive traces from behavioral logs. Our method uses a multi-agent LLM system grounded in Information Foraging Theory (IFT) and validated by human experts. We annotate three public datasets (AOL, Stack Overflow, and MovieLens), producing over 530,000 cognitive labels across 50,000 sessions. A cross-dataset evaluation with a shuffled-label control reveals that cognitive labels provide the strongest signal where behavioral features are weakest: on MovieLens, the cognitive model improves F1 by up to 6.6% over the behavioral baseline and 1.8% above the shuffled control, while on AOL, where click patterns are highly predictive, improvements are near zero. We release the annotation collection on HuggingFace, an open-source annotation tool, and all experimental code to support future work on cognitively aware user simulation.
翻译:用户模拟器对于评估搜索系统至关重要,但现有方法主要复现用户行为,未能建模其底层思维过程。大规模交互日志记录了用户的操作,却未捕捉其可能的思维或情感状态,如困惑或满意。我们提出一个从行为日志中推断认知轨迹的框架。该方法采用基于信息觅食理论(IFT)的多智能体大语言模型系统,并经过人类专家验证。我们对三个公开数据集(AOL、Stack Overflow和MovieLens)进行标注,在5万个会话中生成超过53万个认知标签。通过跨数据集评估及混洗标签对照实验发现,认知标签在行为特征最弱的数据集(MovieLens)上提供最强信号:认知模型相比行为基线在F1上最高提升6.6%,较混洗对照提升1.8%;而在点击模式高度可预测的AOL数据集上,改进近乎为零。我们在HuggingFace上发布标注集合、开源标注工具及全部实验代码,以支持未来认知感知用户模拟研究。