Revealing the Hidden Impact of Top-N Metrics on Optimization in Recommender Systems

The hyperparameters of recommender systems for top-n predictions are typically optimized to enhance the predictive performance of algorithms. Thereby, the optimization algorithm, e.g., grid search or random search, searches for the best hyperparameter configuration according to an optimization-target metric, like nDCG or Precision. In contrast, the optimized algorithm, internally optimizes a different loss function during training, like squared error or cross-entropy. To tackle this discrepancy, recent work focused on generating loss functions better suited for recommender systems. Yet, when evaluating an algorithm using a top-n metric during optimization, another discrepancy between the optimization-target metric and the training loss has so far been ignored. During optimization, the top-n items are selected for computing a top-n metric; ignoring that the top-n items are selected from the recommendations of a model trained with an entirely different loss function. Item recommendations suitable for optimization-target metrics could be outside the top-n recommended items; hiddenly impacting the optimization performance. Therefore, we were motivated to analyze whether the top-n items are optimal for optimization-target top-n metrics. In pursuit of an answer, we exhaustively evaluate the predictive performance of 250 selection strategies besides selecting the top-n. We extensively evaluate each selection strategy over twelve implicit feedback and eight explicit feedback data sets with eleven recommender systems algorithms. Our results show that there exist selection strategies other than top-n that increase predictive performance for various algorithms and recommendation domains. However, the performance of the top ~43% of selection strategies is not significantly different. We discuss the impact of our findings on optimization and re-ranking in recommender systems and feasible solutions.

翻译：推荐系统中用于Top-N预测的超参数通常通过优化算法（如网格搜索或随机搜索）来提升算法性能，其优化目标指标采用nDCG或精确率等。然而，被优化的算法在训练过程中内部会最小化不同的损失函数（如平方误差或交叉熵）。为弥合这一差异，近期研究致力于生成更适合推荐系统的损失函数。但在使用Top-N指标评估算法时，优化目标指标与训练损失之间的另一差异迄今被忽视：优化过程中仅选取Top-N项计算指标，却忽略了这些项源自使用完全不同的损失函数训练的模型推荐结果。适合优化目标指标的物品推荐可能不在Top-N推荐项之列，从而对优化性能产生隐藏影响。为此，我们分析了Top-N项是否对优化目标指标最优。为寻求答案，我们除了选取Top-N外，还系统评估了250种选择策略的预测性能，在12个隐式反馈和8个显式反馈数据集上，使用11种推荐系统算法对每种策略进行广泛测试。结果表明，存在优于Top-N的选择策略，能提升多种算法和推荐领域的预测性能，但排名前约43%的选择策略性能无显著差异。最后讨论了这些发现对推荐系统优化与重排序的影响及可行解决方案。