LLP-Bench: A Large Scale Tabular Benchmark for Learning from Label Proportions

In the task of Learning from Label Proportions (LLP), a model is trained on groups (a.k.a bags) of instances and their corresponding label proportions to predict labels for individual instances. LLP has been applied pre-dominantly on two types of datasets - image and tabular. In image LLP, bags of fixed size are created by randomly sampling instances from an underlying dataset. Bags created via this methodology are called random bags. Experimentation on Image LLP has been mostly on random bags on CIFAR-* and MNIST datasets. Despite being a very crucial task in privacy sensitive applications, tabular LLP does not yet have a open, large scale LLP benchmark. One of the unique properties of tabular LLP is the ability to create feature bags where all the instances in a bag have the same value for a given feature. It has been shown in prior research that feature bags are very common in practical, real world applications [Chen et. al '23, Saket et. al. '22]. In this paper, we address the lack of a open, large scale tabular benchmark. First we propose LLP-Bench, a suite of 70 LLP datasets (62 feature bag and 8 random bag datasets) created from the Criteo CTR prediction and the Criteo Sponsored Search Conversion Logs datasets, the former a classification and the latter a regression dataset. These LLP datasets represent diverse ways in which bags can be constructed from underlying tabular data. To the best of our knowledge, LLP-Bench is the first large scale tabular LLP benchmark with an extensive diversity in constituent datasets. Second, we propose four metrics that characterize and quantify the hardness of a LLP dataset. Using these four metrics we present deep analysis of the 62 feature bag datasets in LLP-Bench. Finally we present the performance of 9 SOTA and popular tabular LLP techniques on all the 62 datasets.

翻译：在标签比例学习（LLP）任务中，模型通过训练包含多个实例的组（即包）及其对应的标签比例，来预测单个实例的标签。LLP主要应用于两类数据集：图像数据和表格数据。在图像LLP中，通过从底层数据集中随机抽样实例来创建固定大小的包，这类包称为随机包。图像LLP的实验主要基于CIFAR-*和MNIST数据集的随机包。尽管表格LLP在隐私敏感应用中至关重要，但目前尚缺乏开放的大规模LLP基准。表格LLP的一个独特性质是能够创建特征包，即包内所有实例在某个特征上具有相同值。已有研究表明，特征包在实际应用场景中非常普遍[Chen等人 '23，Saket等人 '22]。本文针对缺少开放、大规模表格基准的问题，首先提出LLP-Bench——包含70个LLP数据集（62个特征包数据集和8个随机包数据集）的基准套件，这些数据集基于Criteo点击率预测数据集（分类任务）和Criteo赞助搜索转化日志数据集（回归任务）构建。这些LLP数据集展现了从底层表格数据构建包的多样化方式。据我们所知，LLP-Bench是首个组成数据集具有广泛多样性的大规模表格LLP基准。其次，我们提出四个用于表征和量化LLP数据集难度的指标，并利用这些指标对LLP-Bench中的62个特征包数据集进行深入分析。最后，我们在全部62个数据集上呈现了9种最先进与流行的表格LLP技术的性能表现。