The ranking of recommendation algorithms is a challenging problem since model performance is sensitive to dataset characteristics such as sparsity, sequential structure, and scale. This drives a demand for a proper methodology for fair comparison between algorithms. Naive aggregation of performance metrics (e.g., averaging NDCG over benchmarks) can yield misleading rankings, undermining practical selection. To address this problem, we introduce a novel, data-driven ranking methodology based on Bradley-Terry (BT) model. We demonstrate that the obtained ranking depends on key dataset statistics. Additionally, we propose a novel metric for evaluating ranking consistency and demonstrate robustness of our ranking to incomplete data. Finally, we introduce a dataset-specific methodology for ranking algorithms on unseen datasets without running the models, relying on extensions of the Bradley-Terry framework, including BT trees and BT models with covariates.
翻译:推荐算法的排名是一个具有挑战性的问题,因为模型性能对数据集特征(如稀疏性、序列结构和规模)高度敏感。这驱动了对适当方法论的需求,以便在算法间进行公平比较。对性能指标的简单聚合(例如,跨基准测试取NDCG的平均值)可能会产生误导性排名,从而影响实际选择。为解决这一问题,我们提出了一种基于Bradley-Terry(BT)模型的新型数据驱动排名方法论。我们证明了所获得的排名取决于关键的数据集统计特征。此外,我们提出了一种评估排名一致性的新指标,并证明了我们的排名对不完整数据的鲁棒性。最后,我们引入了一种针对特定数据集的算法排名方法论,无需运行模型即可对未见数据集进行排名,该方法依赖于Bradley-Terry框架的扩展,包括BT树和包含协变量的BT模型。