PAC Learning Linear Thresholds from Label Proportions

Learning from label proportions (LLP) is a generalization of supervised learning in which the training data is available as sets or bags of feature-vectors (instances) along with the average instance-label of each bag. The goal is to train a good instance classifier. While most previous works on LLP have focused on training models on such training data, computational learnability of LLP was only recently explored by [Saket'21, Saket'22] who showed worst case intractability of properly learning linear threshold functions (LTFs) from label proportions. However, their work did not rule out efficient algorithms for this problem on natural distributions. In this work we show that it is indeed possible to efficiently learn LTFs using LTFs when given access to random bags of some label proportion in which feature-vectors are, conditioned on their labels, independently sampled from a Gaussian distribution $N(\mathbf{\mu}, \mathbf{\Sigma})$. Our work shows that a certain matrix -- formed using covariances of the differences of feature-vectors sampled from the bags with and without replacement -- necessarily has its principal component, after a transformation, in the direction of the normal vector of the LTF. Our algorithm estimates the means and covariance matrices using subgaussian concentration bounds which we show can be applied to efficiently sample bags for approximating the normal direction. Using this in conjunction with novel generalization error bounds in the bag setting, we show that a low error hypothesis LTF can be identified. For some special cases of the $N(\mathbf{0}, \mathbf{I})$ distribution we provide a simpler mean estimation based algorithm. We include an experimental evaluation of our learning algorithms along with a comparison with those of [Saket'21, Saket'22] and random LTFs, demonstrating the effectiveness of our techniques.

翻译：基于类别比例的学习（LLP）是监督学习的一种泛化形式，其训练数据以特征向量（实例）的集合或袋形式呈现，并附带每个袋的平均实例标签。目标是训练一个优秀的实例分类器。尽管先前关于LLP的研究主要集中于在此类训练数据上训练模型，但[Saket'21, Saket'22]近期才探讨了LLP的计算可学习性，他们证明了在标签比例下适当学习线性阈值函数（LTFs）的极端情况不可计算性。然而，他们的工作并未排除在自然分布上解决该问题的高效算法。在本研究中，我们证明了当能够访问随机袋的某些标签比例时，其中特征向量在给定标签条件下独立采样于高斯分布$N(\mathbf{\mu}, \mathbf{\Sigma})$，确实有可能使用LTFs高效地学习LTFs。我们的研究表明，某个特定矩阵——由来自袋中特征向量差异的协方差（包括有放回和无放回采样）构成——在经过变换后，其主成分必然指向LTF法向量的方向。我们的算法利用次高斯集中界估计均值和协方差矩阵，我们证明了这些界可应用于高效采样袋以近似法线方向。结合袋设置中新颖的泛化误差界，我们表明可识别出低误差的假设LTF。对于$N(\mathbf{0}, \mathbf{I})$分布的特殊情况，我们提供了基于均值估计的简化算法。我们包括了对学习算法的实验评估，并与[Saket'21, Saket'22]的方法及随机LTF进行了比较，证明了我们技术的有效性。