LLP-Bench: A Large Scale Tabular Benchmark for Learning from Label Proportions

In the task of Learning from Label Proportions (LLP), a model is trained on groups (a.k.a bags) of instances and their corresponding label proportions to predict labels for individual instances. LLP has been applied pre-dominantly on two types of datasets - image and tabular. In image LLP, bags of fixed size are created by randomly sampling instances from an underlying dataset. Bags created via this methodology are called random bags. Experimentation on Image LLP has been mostly on random bags on CIFAR-* and MNIST datasets. Despite being a very crucial task in privacy sensitive applications, tabular LLP does not yet have a open, large scale LLP benchmark. One of the unique properties of tabular LLP is the ability to create feature bags where all the instances in a bag have the same value for a given feature. It has been shown in prior research that feature bags are very common in practical, real world applications [Chen et. al '23, Saket et. al. '22]. In this paper, we address the lack of a open, large scale tabular benchmark. First we propose LLP-Bench, a suite of 56 LLP datasets (52 feature bag and 4 random bag datasets) created from the Criteo CTR prediction dataset consisting of 45 million instances. The 56 datasets represent diverse ways in which bags can be constructed from underlying tabular data. To the best of our knowledge, LLP-Bench is the first large scale tabular LLP benchmark with an extensive diversity in constituent datasets. Second, we propose four metrics that characterize and quantify the hardness of a LLP dataset. Using these four metrics we present deep analysis of the 56 datasets in LLP-Bench. Finally we present the performance of 9 SOTA and popular tabular LLP techniques on all the 56 datasets. To the best of our knowledge, our study consisting of more than 2500 experiments is the most extensive study of popular tabular LLP techniques in literature.

翻译：标签比例学习任务中，模型通过训练包含多个实例的组（即袋）及其对应的标签比例，来预测单个实例的标签。LLP主要应用于图像和表格两类数据集。在图像LLP中，通过从原始数据集中随机采样实例创建固定大小的袋，此类方法生成的袋称为随机袋。图像LLP实验主要基于CIFAR-*和MNIST数据集的随机袋。尽管表格LLP在隐私敏感应用中至关重要，但目前仍缺乏公开的大规模表格LLP基准数据集。表格LLP的独特特性之一是能够构建特征袋——即袋内所有实例在某个特征上具有相同取值。已有研究表明，特征袋在实际应用中非常普遍（Chen等，2023；Saket等，2022）。本文针对当前缺乏公开大规模表格基准的问题展开研究。首先，我们提出LLP-Bench——包含56个LLP数据集（52个特征袋数据集和4个随机袋数据集）的基准测试集，基于包含4500万条实例的Criteo点击率预测数据集构建。这56个数据集体现了从底层表格数据构建袋的多样化方式。据我们所知，LLP-Bench是首个在构成数据集上具有广泛多样性的表格LLP大规模基准。其次，我们提出四项指标用于表征和量化LLP数据集的难度，并利用这些指标对LLP-Bench中的56个数据集进行深入分析。最后，我们报告了9种主流及前沿表格LLP方法在全部56个数据集上的性能表现。据我们所知，本研究涵盖超过2500组实验，是文献中对主流表格LLP技术最为全面的评估。