Recommendation systems must continuously adapt to evolving user behavior, yet the volume of data generated in large-scale streaming environments makes frequent full retraining impractical. This work investigates how targeted data selection can mitigate performance degradation caused by temporal distributional drift while maintaining scalability. We evaluate a range of representation choices and sampling strategies for curating small but informative subsets of user interaction data. Our results demonstrate that gradient-based representations, coupled with distribution-matching, improve downstream model performance, achieving training efficiency gains while preserving robustness to drift. These findings highlight data curation as a practical mechanism for scalable monitoring and adaptive model updates in production-scale recommendation systems.
翻译:推荐系统必须持续适应不断变化的用户行为,然而在大规模流式处理环境中,数据量庞大使得频繁的完整重训练不切实际。本文研究了如何通过目标性数据选择来缓解因时间分布漂移导致的性能下降,同时保持可扩展性。我们评估了多种表示选择与采样策略,用于从用户交互数据中筛选出规模较小但信息丰富的子集。实验结果表明,基于梯度的表示方法结合分布匹配技术,能够提升下游模型性能,在保持对漂移鲁棒性的同时实现训练效率提升。这些发现凸显了数据筛选作为一种实用机制,可应用于生产级推荐系统的可扩展监控与自适应模型更新。