This work studies algorithms for learning from aggregate responses. We focus on the construction of aggregation sets (called bags in the literature) for event-level loss functions. We prove for linear regression and generalized linear models (GLMs) that the optimal bagging problem reduces to one-dimensional size-constrained $k$-means clustering. Further, we theoretically quantify the advantage of using curated bags over random bags. We then propose the PriorBoost algorithm, which adaptively forms bags of samples that are increasingly homogeneous with respect to (unobserved) individual responses to improve model quality. We study label differential privacy for aggregate learning, and we also provide extensive experiments showing that PriorBoost regularly achieves optimal model quality for event-level predictions, in stark contrast to non-adaptive algorithms.
翻译:本文研究从聚合响应中学习的算法。我们聚焦于事件级损失函数中聚合集合(文献中常称“袋”)的构建。对于线性回归和广义线性模型(GLM),我们证明最优装袋问题可简化为带一维尺寸约束的$k$-均值聚类。进一步,我们从理论上量化了使用定制化袋相较于随机袋的优势。随后我们提出PriorBoost算法,该算法自适应地构建样本袋,使袋内样本在(未观测的)个体响应上逐步趋于同质,从而提升模型质量。我们研究了针对聚合学习的标签差分隐私,并通过大量实验表明,PriorBoost能稳定实现事件级预测的最优模型质量,这与非自适应算法形成鲜明对比。