Evaluating LLP Methods: Challenges and Approaches

Learning from Label Proportions (LLP) is an established machine learning problem with numerous real-world applications. In this setting, data items are grouped into bags, and the goal is to learn individual item labels, knowing only the features of the data and the proportions of labels in each bag. Although LLP is a well-established problem, it has several unusual aspects that create challenges for benchmarking learning methods. Fundamental complications arise because of the existence of different LLP variants, i.e., dependence structures that can exist between items, labels, and bags. Accordingly, the first algorithmic challenge is the generation of variant-specific datasets capturing the diversity of dependence structures and bag characteristics. The second methodological challenge is model selection, i.e., hyperparameter tuning; due to the nature of LLP, model selection cannot easily use the standard machine learning paradigm. The final benchmarking challenge consists of properly evaluating LLP solution methods across various LLP variants. We note that there is very little consideration of these issues in prior work, and there are no general solutions for these challenges proposed to date. To address these challenges, we develop methods capable of generating LLP datasets meeting the requirements of different variants. We use these methods to generate a collection of datasets encompassing the spectrum of LLP problem characteristics, which can be used in future evaluation studies. Additionally, we develop guidelines for benchmarking LLP algorithms, including the model selection and evaluation steps. Finally, we illustrate the new methods and guidelines by performing an extensive benchmark of a set of well-known LLP algorithms. We show that choosing the best algorithm depends critically on the LLP variant and model selection method, demonstrating the need for our proposed approach.

翻译：标签比例学习（LLP）是一个已建立的机器学习问题，具有众多实际应用。在该设定中，数据项被分组为袋（bag），目标是在仅知道数据特征和每个袋中标签比例的情况下，学习单个数据项的标签。尽管LLP是一个成熟的问题，但它存在若干特殊之处，给学习方法的基准测试带来了挑战。根本性的复杂性源于不同LLP变体的存在，即数据项、标签和袋之间可能存在的依赖结构。因此，第一个算法挑战是生成特定于变体的数据集，以捕捉依赖结构和袋特征的多样性。第二个方法论挑战是模型选择，即超参数调优；由于LLP的性质，模型选择难以直接使用标准的机器学习范式。最终的基准测试挑战包括在各种LLP变体下恰当评估LLP解决方法。我们注意到，先前工作对这些问题的考虑很少，且至今没有针对这些挑战的通用解决方案。为应对这些挑战，我们开发了能够生成满足不同变体要求的LLP数据集的方法。利用这些方法，我们生成了一个覆盖LLP问题特性频谱的数据集集合，可用于未来的评估研究。此外，我们制定了LLP算法基准测试的指南，包括模型选择和评估步骤。最后，我们通过对一组知名LLP算法进行广泛的基准测试，展示了新方法和指南的应用。结果表明，最佳算法的选择关键取决于LLP变体和模型选择方法，这验证了我们提出方法的必要性。