While deep learning (DL) models are state-of-the-art in text and image domains, they have not yet consistently outperformed Gradient Boosted Decision Trees (GBDTs) on tabular Learning-To-Rank (LTR) problems. Most of the recent performance gains attained by DL models in text and image tasks have used unsupervised pretraining, which exploits orders of magnitude more unlabeled data than labeled data. To the best of our knowledge, unsupervised pretraining has not been applied to the LTR problem, which often produces vast amounts of unlabeled data. In this work, we study whether unsupervised pretraining can improve LTR performance over GBDTs and other non-pretrained models. Using simple design choices--including SimCLR-Rank, our ranking-specific modification of SimCLR (an unsupervised pretraining method for images)--we produce pretrained deep learning models that soundly outperform GBDTs (and other non-pretrained models) in the case where labeled data is vastly outnumbered by unlabeled data. We also show that pretrained models also often achieve significantly better robustness than non-pretrained models (GBDTs or DL models) in ranking outlier data.
翻译:尽管深度学习模型在文本和图像领域已达到最先进水平,但在表格型学习排序(LTR)问题上,其表现尚未持续超越梯度提升决策树(GBDT)。近期深度学习模型在文本和图像任务中取得的性能提升,主要得益于无监督预训练——该方法利用的未标注数据量比标注数据高出数个数量级。据我们所知,无监督预训练尚未被应用于常产生海量未标注数据的LTR问题。本研究探讨了无监督预训练能否提升LTR性能使其超越GBDT及其他非预训练模型。通过采用简单设计——包括针对排序任务改进的SimCLR变体SimCLR-Rank(即图像无监督预训练方法SimCLR的排序专用改良版)——我们构建的预训练深度学习模型在标注数据远少于未标注数据的情况下,表现显著优于GBDT及其他非预训练模型。研究还表明,预训练模型在排序异常数据时,往往能比非预训练模型(GBDT或深度学习模型)实现更优的鲁棒性。