Coarse data arise when learners observe only partial information about samples; namely, a set containing the sample rather than its exact value. This occurs naturally through measurement rounding, sensor limitations, and lag in economic systems. We study Gaussian mean estimation from coarse data, where each true sample $x$ is drawn from a $d$-dimensional Gaussian distribution with identity covariance, but is revealed only through the set of a partition containing $x$. When the coarse samples, roughly speaking, have ``low'' information, the mean cannot be uniquely recovered from observed samples (i.e., the problem is not identifiable). Recent work by Fotakis, Kalavasis, Kontonis, and Tzamos [FKKT21] established that sample-efficient mean estimation is possible when the unknown mean is identifiable and the partition consists of only convex sets. Moreover, they showed that without convexity, mean estimation becomes NP-hard. However, two fundamental questions remained open: (1) When is the mean identifiable under convex partitions? (2) Is computationally efficient estimation possible under identifiability and convex partitions? This work resolves both questions. [...]
翻译:粗数据产生于学习者仅能观测到样本的部分信息,即仅能获取包含样本的集合而非其精确值。这种现象常见于测量舍入、传感器限制以及经济系统中的滞后效应。本文研究基于粗数据的高斯均值估计问题,其中每个真实样本$x$均从协方差矩阵为单位矩阵的$d$维高斯分布中抽取,但仅能通过包含$x$的划分集合进行观测。当粗样本(粗略而言)具有"低"信息量时,均值无法从观测样本中唯一恢复(即该问题不可识别)。Fotakis、Kalavasis、Kontonis与Tzamos的最新研究[FKKT21]证明:当未知均值可识别且划分仅由凸集构成时,可实现样本高效的均值估计。此外,他们指出若缺乏凸性条件,均值估计将变为NP难问题。然而,两个根本性问题尚未解决:(1)在凸划分条件下均值何时可识别?(2)在可识别性与凸划分条件下是否存在计算高效的估计算法?本研究完整解决了这两个问题。 [...]