The typical offline protocol to evaluate recommendation algorithms is to collect a dataset of user-item interactions and then use a part of this dataset to train a model, and the remaining data to measure how closely the model recommendations match the observed user interactions. This protocol is straightforward, useful and practical, but it only provides snapshot performance. We know, however, that online systems evolve over time. In general, it is a good idea that models are frequently retrained with recent data. But if this is the case, to what extent can we trust previous evaluations? How will a model perform when a different pattern (re)emerges? In this paper we propose a methodology to study how recommendation models behave when they are retrained. The idea is to profile algorithms according to their ability to, on the one hand, retain past patterns - stability - and, on the other hand, (quickly) adapt to changes - plasticity. We devise an offline evaluation protocol that provides detail on the long-term behavior of models, and that is agnostic to datasets, algorithms and metrics. To illustrate the potential of this framework, we present preliminary results of three different types of algorithms on the GoodReads dataset that suggest different stability and plasticity profiles depending on the algorithmic technique, and a possible trade-off between stability and plasticity. We further discuss the potential and limitations of the proposal and advance some possible improvements.
翻译:典型的推荐算法离线评估协议是:收集用户-物品交互数据集,利用其中一部分训练模型,再用剩余数据衡量模型推荐结果与观测到的用户交互的吻合程度。该协议简洁实用且具操作性,但仅能提供静态快照式的性能评估。然而我们知道,在线系统会随时间动态演化。一般而言,定期用最新数据重新训练模型是合理做法。但若如此,我们能在多大程度上信任先前的评估结果?当不同模式(重新)出现时,模型将如何表现?本文提出一种方法论,用于研究推荐模型在重新训练时的行为特性。其核心思路是通过算法两方面的能力进行特征刻画:一是保留历史模式的能力(稳定性),二是快速适应变化的能力(可塑性)。我们设计了一套离线评估协议,能够详细揭示模型的长期行为特征,且该协议与数据集、算法和评估指标无关。为展示该框架的潜力,我们基于GoodReads数据集对三类不同算法进行了初步实验,结果表明不同算法技术会呈现差异化的稳定-可塑性特征,且两者间可能存在权衡关系。本文还进一步讨论了该方法的潜力与局限性,并提出若干可能的改进方向。