Learning Next Action Predictors from Human-Computer Interaction

Omar Shaikh,Valentin Teutschbein,Kanishk Gandhi,Yikun Chi,Nick Haber,Thomas Robinson,Nilam Ram,Byron Reeves,Sherry Yang,Michael S. Bernstein,Diyi Yang

from arxiv, 32 pages, 10 figures, see https://generalusermodels.github.io/nap

Truly proactive AI systems must anticipate what we will do next. This foresight demands far richer information than the sparse signals we type into our prompts -- it demands reasoning over the entire context of what we see and do. We formalize this as next action prediction (NAP): given a sequence of a user's multimodal interactions with a computer (screenshots, clicks, sensor data), predict that user's next action. Progress on this task requires both new data and modeling approaches. To scale data, we annotate longitudinal, naturalistic computer use with vision-language models. We release an open-source pipeline for performing this labeling on private infrastructure, and label over 360K actions across one month of continuous phone usage from 20 users, amounting to 1,800 hours of screen time. We then introduce LongNAP, a user model that combines parametric and in-context learning to reason over long interaction histories. LongNAP is trained via policy gradient methods to generate user-specific reasoning traces given some context; retrieve relevant traces from a library of past traces; and then apply retrieved traces in-context to predict future actions. Using an LLM-as-judge evaluation metric (0-1 similarity to ground truth), LongNAP significantly outperforms supervised finetuning and prompted baselines on held-out data (by 79% and 39% respectively). Additionally, LongNAP generalizes to held out users when trained across individuals. The space of next actions a user might take at any moment is unbounded, spanning thousands of possible outcomes. Despite this, 17.1% of LongNAP's predicted trajectories are well-aligned with what a user does next (LLM-judge score $\geq$ 0.5). This rises to 26% when we filter to highly confident predictions. In sum, we argue that learning from the full context of user behavior to anticipate user needs is now a viable task with substantial opportunity.

翻译：真正具有前瞻性的人工智能系统必须能够预测我们下一步的行动。这种预见性所需的信息远不止我们输入提示时的稀疏信号——它要求对我们所见所为的完整上下文进行推理。我们将此形式化为下一动作预测（NAP）：给定用户与计算机进行多模态交互的序列（屏幕截图、点击、传感器数据），预测该用户的下一动作。推进该任务需要新的数据与建模方法。为实现数据规模化，我们利用视觉语言模型对纵向、自然的计算机使用行为进行标注。我们发布了在私有基础设施上执行此标注的开源流程，并对20名用户连续一个月手机使用中超过36万次动作进行了标注，累计屏幕使用时间达1,800小时。随后我们提出LongNAP——一种结合参数化学习与上下文学习的用户模型，能够对长交互历史进行推理。LongNAP通过策略梯度方法训练，实现在给定上下文时生成用户特定的推理轨迹；从历史轨迹库中检索相关轨迹；并应用检索到的上下文轨迹来预测未来动作。使用LLM作为评判器的评估指标（与真实情况的0-1相似度），LongNAP在预留数据上显著优于监督微调与提示基线方法（分别提升79%和39%）。此外，当跨个体训练时，LongNAP能够泛化到未见过的用户。用户在任意时刻可能采取的下一动作空间是无限的，涵盖数千种可能结果。尽管如此，LongNAP预测轨迹中仍有17.1%与用户后续行为高度吻合（LLM评判分数≥0.5）。当我们筛选高置信度预测时，该比例提升至26%。综上所述，我们认为通过用户行为的完整上下文学习来预判用户需求，已成为具有重大发展前景的可行研究方向。