The typical offline protocol to evaluate recommendation algorithms is to collect a dataset of user-item interactions and then use a part of this dataset to train a model, and the remaining data to measure how closely the model recommendations match the observed user interactions. This protocol is straightforward, useful and practical, but it only provides snapshot performance. We know, however, that online systems evolve over time. In general, it is a good idea that models are frequently retrained with recent data. But if this is the case, to what extent can we trust previous evaluations? How will a model perform when a different pattern (re)emerges? In this paper we propose a methodology to study how recommendation models behave when they are retrained. The idea is to profile algorithms according to their ability to, on the one hand, retain past patterns - stability - and, on the other hand, (quickly) adapt to changes - plasticity. We devise an offline evaluation protocol that provides detail on the long-term behavior of models, and that is agnostic to datasets, algorithms and metrics. To illustrate the potential of this framework, we present preliminary results of three different types of algorithms on the GoodReads dataset that suggest different stability and plasticity profiles depending on the algorithmic technique, and a possible trade-off between stability and plasticity. We further discuss the potential and limitations of the proposal and advance some possible improvements.
翻译:推荐算法的典型离线评估协议是:收集用户-项目交互数据集,利用其中部分数据训练模型,再用剩余数据衡量模型推荐结果与观测用户交互的匹配程度。该协议虽然直接、实用且有效,但仅能提供瞬时性能指标。然而我们知道,在线系统会随时间推移不断演化。通常建议使用最新数据对模型进行频繁重训练,但若如此,先前评估结果的可信度又当如何?当不同模式(重新)出现时模型将如何表现?本文提出了一种研究推荐模型重训练行为的方法论,通过评估算法保留历史模式的能力(稳定性)与快速适应变化的能力(可塑性)来构建算法轮廓。我们设计了一种与数据集、算法和评估指标无关的离线评估协议,可揭示模型的长期行为特征。为展示该框架的潜力,我们在GoodReads数据集上对三类不同算法进行了初步实验,结果表明不同算法技术会呈现差异化的稳定性-可塑性特征,且两者间可能存在权衡关系。最后,我们讨论了该方法的潜力与局限性,并提出了若干改进方向。