We study the density estimation problem defined as follows: given $k$ distributions $p_1, \ldots, p_k$ over a discrete domain $[n]$, as well as a collection of samples chosen from a ``query'' distribution $q$ over $[n]$, output $p_i$ that is ``close'' to $q$. Recently~\cite{aamand2023data} gave the first and only known result that achieves sublinear bounds in {\em both} the sampling complexity and the query time while preserving polynomial data structure space. However, their improvement over linear samples and time is only by subpolynomial factors. Our main result is a lower bound showing that, for a broad class of data structures, their bounds cannot be significantly improved. In particular, if an algorithm uses $O(n/\log^c k)$ samples for some constant $c>0$ and polynomial space, then the query time of the data structure must be at least $k^{1-O(1)/\log \log k}$, i.e., close to linear in the number of distributions $k$. This is a novel \emph{statistical-computational} trade-off for density estimation, demonstrating that any data structure must use close to a linear number of samples or take close to linear query time. The lower bound holds even in the realizable case where $q=p_i$ for some $i$, and when the distributions are flat (specifically, all distributions are uniform over half of the domain $[n]$). We also give a simple data structure for our lower bound instance with asymptotically matching upper bounds. Experiments show that the data structure is quite efficient in practice.
翻译:我们研究如下定义的密度估计问题:给定离散域$[n]$上的$k$个分布$p_1, \ldots, p_k$,以及从$[n]$上的“查询”分布$q$中选取的样本集合,输出与$q$“接近”的$p_i$。最近~\cite{aamand2023data} 提出了首个且目前唯一已知的成果,该成果在保持多项式数据结构空间的同时,实现了采样复杂度和查询时间均达到亚线性界限。然而,相较于线性采样与时间,他们的改进仅体现在亚多项式因子层面。我们的主要成果是一个下界证明,表明对于广泛类型的数据结构,其界限无法获得显著改进。具体而言,若某算法使用$O(n/\log^c k)$个样本(其中$c>0$为常数)且占用多项式空间,则该数据结构的查询时间至少需达到$k^{1-O(1)/\log \log k}$,即接近分布数量$k$的线性级别。这为密度估计揭示了一种新颖的\emph{统计-计算}权衡关系,证明任何数据结构必须使用接近线性数量的样本或消耗接近线性的查询时间。该下界即使在可实现的场景下(即存在某个$i$使得$q=p_i$)以及分布为平坦分布时(具体而言,所有分布均在域$[n]$的一半上均匀分布)依然成立。我们还针对下界实例给出了一个具有渐近匹配上界的简洁数据结构。实验表明该数据结构在实际应用中具有较高效率。