Motivated by applications in statistics and machine learning, we consider a problem of unmixing convex combinations of nonparametric densities. Suppose we observe $n$ groups of samples, where the $i$th group consists of $N_i$ independent samples from a $d$-variate density $f_i(x)=\sum_{k=1}^K π_i(k)g_k(x)$. Here, each $g_k(x)$ is a nonparametric density, and each $π_i$ is a $K$-dimensional mixed membership vector. We aim to estimate $g_1(x), \ldots,g_K(x)$. This problem generalizes topic modeling from discrete to continuous variables and finds its applications in LLMs with word embeddings. In this paper, we propose an estimator for the above problem, which modifies the classical kernel density estimator by assigning group-specific weights that are computed by topic modeling on histogram vectors and de-biased by U-statistics. For any $β>0$, assuming that each $g_k(x)$ is in the Nikol'ski class with a smooth parameter $β$, we show that the sum of integrated squared errors of the constructed estimators has a convergence rate that depends on $n$, $K$, $d$, and the per-group sample size $N$. We also provide a matching lower bound, which suggests that our estimator is rate-optimal.
翻译:受统计学和机器学习应用的启发,我们考虑一个从非参数密度的凸组合中解混的问题。假设我们观察到n组样本,其中第i组包含来自d维密度f_i(x)=∑_{k=1}^K π_i(k)g_k(x)的N_i个独立样本。这里,每个g_k(x)是一个非参数密度,每个π_i是一个K维的混合隶属度向量。我们的目标是估计g_1(x),...,g_K(x)。该问题将主题建模从离散变量推广到连续变量,并在基于词嵌入的LLMs中找到其应用。本文针对上述问题提出一种估计器,它通过基于直方图向量的主题建模计算各组特异性权重,并用U统计量进行去偏,从而改进了经典的核密度估计器。对于任意β>0,假设每个g_k(x)属于光滑参数为β的Nikol'skii类,我们证明所构造估计量的积分平方误差之和的收敛速度取决于n、K、d以及每组样本量N。我们还提供了匹配的下界,表明我们的估计量达到了率最优性。