The typical offline protocol to evaluate recommendation algorithms is to collect a dataset of user-item interactions and then use a part of this dataset to train a model, and the remaining data to measure how closely the model recommendations match the observed user interactions. This protocol is straightforward, useful and practical, but it only captures performance of a particular model trained at some point in the past. We know, however, that online systems evolve over time. In general, it is a good idea that models reflect such changes, so models are frequently retrained with recent data. But if this is the case, to what extent can we trust previous evaluations? How will a model perform when a different pattern (re)emerges? In this paper we propose a methodology to study how recommendation models behave when they are retrained. The idea is to profile algorithms according to their ability to, on the one hand, retain past patterns - stability - and, on the other hand, (quickly) adapt to changes - plasticity. We devise an offline evaluation protocol that provides detail on the long-term behavior of models, and that is agnostic to datasets, algorithms and metrics. To illustrate the potential of this framework, we present preliminary results of three different types of algorithms on the GoodReads dataset that suggest different stability and plasticity profiles depending on the algorithmic technique, and a possible trade-off between stability and plasticity. Although additional experiments will be necessary to confirm these observations, they already illustrate the usefulness of the proposed framework to gain insights on the long term dynamics of recommendation models.
翻译:评估推荐算法的典型离线协议是收集用户-物品交互数据集,使用部分数据训练模型,并利用剩余数据衡量模型推荐结果与观测到的用户交互行为的匹配程度。该协议直接、实用且高效,但仅能捕捉特定历史时间点训练所得模型的性能表现。然而,我们认识到在线系统会随时间动态演化。理想情况下,模型应能反映此类变化,因此系统常使用近期数据对模型进行重训练。但由此引出的问题是:我们能在多大程度上信赖既往的评估结果?当不同模式(重新)出现时,模型将如何表现?本文提出一种研究推荐模型在重训练过程中行为特性的方法论。其核心思想是从两个维度对算法进行剖析:一方面考察其保持历史模式的能力——即稳定性;另一方面评估其(快速)适应变化的能力——即可塑性。我们设计了一种与数据集、算法及评价指标无关的离线评估协议,该协议能够揭示模型的长期行为特征。为展示该框架的潜力,我们在GoodReads数据集上对三类算法进行了初步实验,结果表明不同算法技术呈现出差异化的稳定性与可塑性特征,且二者之间存在潜在的权衡关系。虽然仍需进一步实验验证这些发现,但现有结果已证明该框架对于深入理解推荐模型长期动态特性的实用价值。