Modeling of long history data suffers from long-context window attention dilution, system efficiency and catastrophic forgetting problems, where naive linear scaling approach like LastN would fail. We introduce Memento, a personalized retrieval-augmented framework that treats historical user engagements as a document corpus and ad requests as queries, retrieving relevant interactions via Maximal Marginal Relevance (MMR) to balance similarity with diversity. We identify two complementary applications: Representation Memento, which retrieves historical embeddings for feature augmentation, and Data Memento, which retrieves past training examples for multipass training. Through infrastructure co-design -- temporal chunking, INT8 quantization, and asynchronous serving -- Memento achieves 5-10$\times$ resource efficiency over linear scaling. Memento processes daily requests with sub-10ms latency, yielding 0.25-0.3% Normalized Entropy gain on both click-through and conversion prediction. In production, Memento delivers a 1% CTR lift on Facebook Feed and Reels and a 1.2% CVR lift, scaling personalization to 365+ days of history.
翻译:长历史数据建模面临长上下文窗口下的注意力稀释、系统效率及灾难性遗忘问题,而像LastN这种朴素线性扩展方法难以奏效。我们提出Memento,一种个性化检索增强框架,将用户历史交互视为文档库、广告请求视为查询,通过最大边际相关性(MMR)在相似性与多样性间取得平衡来检索相关交互。我们识别出两种互补应用:表征Memento(检索历史嵌入用于特征增强)与数据Memento(检索历史训练样本用于多重训练)。通过基础设施协同设计——时间分块、INT8量化及异步服务——Memento相较线性扩展实现了5-10倍的资源效率提升。Memento处理每日请求的延迟低于10毫秒,在点击率预测和转化率预测上均带来0.25%-0.3%的归一化熵增益。在生产环境中,Memento在Facebook Feed和Reels上实现1%的CTR提升及1.2%的CVR提升,将个性化能力扩展至365天以上的历史数据。