A central role of personal-agent memory is to turn stored information and prior interactions into future-oriented assistance. In daily use, useful cues come from what the agent observes and how the user interacts with the agent, and the agent must carry them forward from the current request to similar future tasks. Existing memory benchmarks usually test dialogue recall or task improvement in isolation, leaving the trajectory from streaming observations to later assistance largely untested. We introduce StreamMemBench, a streaming benchmark that constructs a two-step task sequence around each evidence anchor from EgoLife egocentric streams. The initial task tests evidence use, while the follow-up task tests whether feedback and interaction experience are reused. Four metrics diagnose evidence recall, initial evidence use, feedback incorporation, and follow-up reuse. Experiments with eight memory systems across two backbones show that current systems often fail to use observed evidence or turn feedback into reliable follow-up behavior, even when evidence is stored or feedback is incorporated locally. StreamMemBench is publicly available at https://github.com/landian60/StreamMemBench.
翻译:个人智能体记忆的核心作用是将存储的信息和先前的交互转化为面向未来的辅助。在日常使用中,有用的线索来自智能体的观察结果以及用户与智能体的交互方式,智能体必须将这些线索从当前请求延续到类似的未来任务。现有的记忆基准通常孤立地测试对话回忆或任务改进,使得从流式观察到后续辅助的轨迹在很大程度上未被测试。我们提出了StreamMemBench,这是一个流式基准,它围绕EgoLife自我中心流中的每个证据锚点构建一个两步任务序列。初始任务测试证据的使用,而后续任务测试反馈和交互经验是否被重复利用。四个指标用于诊断证据回忆、初始证据使用、反馈整合和后续重用。在两种主干架构上对八种记忆系统进行的实验表明,当前系统即使在证据被存储或反馈被局部整合的情况下,也常常无法使用观察到的证据或将反馈转化为可靠的后续行为。StreamMemBench可在https://github.com/landian60/StreamMemBench公开获取。