Large-language-model (LLM) based user simulation is increasingly adopted for evaluating search engines, recommender systems, and retrieval-augmented generation pipelines, yet most simulators remain opaque: it is difficult to determine why a simulated user made a particular choice or whether that choice is consistent with the intended user profile. Compounding this, recent research shows that LLMs can produce biased or discriminatory responses depending on user background characteristics such as language, education level, and cultural context, raising concerns about the equitable treatment of minority and disadvantaged groups. This half-day, in-person tutorial introduces a proposed design-and-audit framework that treats a user simulator as a verifiable engineering artefact composed of seven auditable components - structured Persona, task-aware Contract, matched human-vs-agent Execution, auditable Trace, persona-aligned Verification, structured Feedback, and a Refinement loop that updates personas and contracts. Through two hands-on mini-labs on recommendation-list evaluation and search-query formulation, participants will inspect simulator behaviour end-to-end, distinguish diagnostic discrepancy analysis from statistical validation, and apply checks for fidelity, credibility, and demographic bias. The tutorial targets information retrieval and recommender systems researchers and practitioners interested in user behaviour simulation and responsible AI.
翻译:基于大语言模型的用户模拟正越来越多地用于评估搜索引擎、推荐系统及检索增强生成流水线,然而大多数模拟器仍不透明:难以判断模拟用户为何做出特定选择,或该选择是否与设定的用户画像一致。更甚的是,最新研究表明大语言模型可能根据用户背景特征(如语言、教育水平和文化背景)产生偏见或歧视性响应,引发对少数群体和弱势群体公平对待的担忧。本半日实地教程提出了一种"设计-审计"框架,将用户模拟器视为由七个可审计组件构成的可验证工程制品——结构化人物画像、任务感知契约、匹配的人机执行、可审计轨迹、画像对齐验证、结构化反馈以及更新人物画像与契约的优化循环。通过推荐列表评估与搜索查询构造两项动手实践实验,参与者将端到端检查模拟器行为,区分诊断性差异分析与统计验证,并对保真度、可信度及人口统计学偏见进行检验。本教程面向对用户行为模拟和负责任人工智能感兴趣的信息检索与推荐系统研究人员及实践者。