For many learning problems one may not have access to fine grained label information; e.g., an image can be labeled as husky, dog, or even animal depending on the expertise of the annotator. In this work, we formalize these settings and study the problem of learning from such coarse data. Instead of observing the actual labels from a set $\mathcal{Z}$, we observe coarse labels corresponding to a partition of $\mathcal{Z}$ (or a mixture of partitions). Our main algorithmic result is that essentially any problem learnable from fine grained labels can also be learned efficiently when the coarse data are sufficiently informative. We obtain our result through a generic reduction for answering Statistical Queries (SQ) over fine grained labels given only coarse labels. The number of coarse labels required depends polynomially on the information distortion due to coarsening and the number of fine labels $|\mathcal{Z}|$. We also investigate the case of (infinitely many) real valued labels focusing on a central problem in censored and truncated statistics: Gaussian mean estimation from coarse data. We provide an efficient algorithm when the sets in the partition are convex and establish that the problem is NP-hard even for very simple non-convex sets.
翻译:在许多学习问题中,我们可能无法获取细粒度的标签信息;例如,根据标注者的专业知识,一张图像可能被标记为"哈士奇"、"狗"甚至"动物"。本文将这些情景形式化,并研究了从这类粗粒度数据中学习的问题。我们并非观察到来自集合 $\mathcal{Z}$ 的真实标签,而是观察到对应于 $\mathcal{Z}$ 的一个划分(或混合划分)的粗标签。我们的主要算法结论是:本质上,任何可从细粒度标签中学习的问题,在粗数据包含足够信息时也能被高效学习。我们通过一种通用归约方法获得该结论,该方法仅利用粗标签回答关于细粒度标签的统计查询(SQ)。所需的粗标签数量取决于粗化导致的信息失真以及细标签数量 $|\mathcal{Z}|$ 的多项式。我们还研究了(无穷多个)实值标签的情况,重点关注删失与截断统计学中的一个核心问题:基于粗数据的高斯均值估计。当划分中的集合为凸集时,我们提供了一种高效算法,并证明即使对于非常简单的非凸集,该问题也是NP难的。