Sapling Similarity: a performing and interpretable memory-based tool for recommendation

Many bipartite networks describe systems where an edge represents a relation between a user and an item. Measuring the similarity between either users or items is the basis of memory-based collaborative filtering, a widely used method to build a recommender system with the purpose of proposing items to users. When the edges of the network are unweighted, the popular common neighbors-based approaches, allowing only positive similarity values, neglect the possibility and the effect of two users (or two items) being very dissimilar. Moreover, they underperform with respect to model-based (machine learning) approaches, although providing higher interpretability. Inspired by the functioning of Decision Trees, we propose a method to compute similarity that allows also negative values, the Sapling Similarity. The key idea is to look at how the information that a user is connected to an item influences our prior estimation of the probability that another user is connected to the same item: if it is reduced, then the similarity between the two users will be negative, otherwise, it will be positive. We show that, when used to build memory-based collaborative filtering, Sapling Similarity provides better recommendations than existing similarity metrics. Then we compare the Sapling Similarity Collaborative Filtering (SSCF, a hybrid of the item-based and the user-based) with state-of-the-art models using standard datasets. Even if SSCF depends on only one straightforward hyperparameter, it has comparable or higher recommending accuracy, and outperforms all other models on the Amazon-Book dataset, while retaining the high explainability of memory-based approaches.

翻译：许多二分网络描述了这样一个系统：边代表用户与物品之间的关系。测量用户或物品之间的相似度是基于记忆的协同过滤的基础，这是一种广泛使用的构建推荐系统的方法，旨在向用户推荐物品。当网络的边为无权时，流行的基于共同邻居的方法仅允许正相似度值，忽略了两用户（或两物品）可能非常不相似的可能性及其影响。此外，尽管这些方法在可解释性方面更高，但它们在性能上不如基于模型（机器学习）的方法。受决策树工作原理的启发，我们提出了一种计算相似度的方法——苗木相似度，其允许负值。关键思想在于，观察用户与物品连接的信息如何影响我们对另一用户与该同一物品连接概率的先验估计：如果该概率降低，则两用户之间的相似度为负；否则为正。我们证明，当用于构建基于记忆的协同过滤时，苗木相似度能提供比现有相似度指标更好的推荐。随后，我们使用标准数据集将苗木相似度协同过滤（SSCF，一种基于物品与基于用户的混合方法）与最先进的模型进行比较。尽管SSCF仅依赖一个简单的超参数，但它具有相当或更高的推荐精度，在Amazon-Book数据集上优于所有其他模型，同时保留了基于记忆方法的高可解释性。