On tabular data, a significant body of literature has shown that current deep learning (DL) models perform at best similarly to Gradient Boosted Decision Trees (GBDTs), while significantly underperforming them on outlier data. However, these works often study idealized problem settings which may fail to capture complexities of real-world scenarios. We identify a natural tabular data setting where DL models can outperform GBDTs: tabular Learning-to-Rank (LTR) under label scarcity. Tabular LTR applications, including search and recommendation, often have an abundance of unlabeled data, and scarce labeled data. We show that DL rankers can utilize unsupervised pretraining to exploit this unlabeled data. In extensive experiments over both public and proprietary datasets, we show that pretrained DL rankers consistently outperform GBDT rankers on ranking metrics -- sometimes by as much as 38% -- both overall and on outliers.
翻译:在表格数据上,大量文献表明当前深度学习(DL)模型的表现至多与梯度提升决策树(GBDT)相当,且在异常数据上显著逊于后者。然而,这些研究通常基于理想化的问题设定,可能未能充分反映真实场景的复杂性。我们识别出一种深度学习模型能够超越GBDT的自然表格数据场景:标签稀缺条件下的表格排序学习(LTR)。表格排序学习应用(包括搜索与推荐系统)通常存在大量未标注数据和稀缺的标注数据。我们证明深度学习排序器可通过无监督预训练来利用这些未标注数据。在公开数据集与专有数据集上的大量实验表明,预训练的深度学习排序器在排序指标上始终优于GBDT排序器——有时优势高达38%——无论是在整体表现还是异常数据处理方面。