We study a web-deployed, tool-augmented LLM health coach with real users. In a pilot with seven users (280 rated turns), offline policy evaluation (OPE) over factorized decision heads (Tool/Style) shows that a uniform heavy-tool policy raises average value on logs but harms specific subgroups, most notably low-health-literacy/high-self-efficacy users. A lightweight simulator with hidden archetypes further shows that adding a small early information-gain bonus reliably shortens trait identification and improves goal success and pass@3. Together, these early findings indicate an evaluation-first path to personalization: freeze the generator, learn subgroup-aware decision heads on typed rewards (objective tool outcomes and satisfaction), and always report per-archetype metrics to surface subgroup harms that averages obscure.
翻译:我们研究了一个面向真实用户的网络部署、工具增强型LLM健康指导系统。在一项包含七名用户(280个评分轮次)的试点实验中,通过对分解决策头(工具/风格)进行离线策略评估发现,统一的强工具策略虽能提升日志中的平均价值,却对特定子群体(尤其是健康素养低/自我效能感高的用户)产生负面影响。一个包含隐藏原型的轻量级模拟器进一步表明,添加少量早期信息增益奖励能可靠地缩短特质识别时间,并提升目标达成率与pass@3指标。这些初步发现共同指出了一条评估优先的个性化路径:冻结生成器,基于类型化奖励(客观工具结果与满意度)学习具有子群体感知能力的决策头,并始终报告按原型划分的指标,以揭示被平均值掩盖的子群体损害。