Kernel Density Estimation by Spectral Decomposition: Data-Driven Tapering and Superposition

Kernel density estimation depends largely on one choice, the smoothing bandwidth. We treat bandwidth selection and density estimation in the characteristic-function domain, where the cyclic group-averaged covariance of the binned data has the squared empirical characteristic function as its spectrum: the true characteristic function sits over a sampling-noise floor of $1/n$, and the bandwidth is the spectral cutoff where the two meet. Several methods follow. An automatic selector strips the floor and minimizes a frequency-domain error criterion, matching the rule of thumb on smooth densities and approaching the best fixed bandwidth on multimodal ones. An adaptive estimator generalizes the fixed kernel to the per-frequency optimal Wiener taper, matching or surpassing the best fixed bandwidth on most standard densities, including sharply peaked and comb-like cases where fixed bandwidths fail; deconvolution under known measurement error follows in the same domain. Because the Wiener estimator resolves sharp structure but does not fit smooth bases as economically as a mixture, a Gaussian mixture is combined with it two ways, a piecewise partition and a superposition of a smooth base and a band-limited residual, the default. A data-driven floor read from the spectrum replaces the assumed $1/n$ floor and stays robust on heaped and rounded data. On the Marron-Wand benchmark scored by exact integrated squared error, the advantage emerges with sample size, a bias-variance tradeoff: the spectral estimators carry low bias but pay in variance, so cross-validation leads at $n=100$ while the Wiener filter and superposition take the top two ranks at $n=5000$. The methods are validated on six real datasets (CRSP returns, NHANES self-reports, CMS dimuon and SDSS spectra, a random-beacon stream, and UNSW-NB15 traffic) and on a synthetic-data quality check. All experiments are reproducible.

翻译：核密度估计在很大程度上依赖于平滑带宽的选择。我们在特征函数域中处理带宽选择和密度估计问题，其中分组数据的循环群平均协方差以平方经验特征函数为其谱：真实特征函数位于采样噪声基底（$1/n$）之上，带宽即为两者相遇处的谱截断点。由此发展出若干方法：一种自动选择器剥离噪声基底并最小化频域误差准则，在平滑密度上匹配经验法则，在多峰密度上逼近最优固定带宽；一种自适应估计器将固定核推广为逐频率最优维纳锥化，在大多数标准密度（包括固定带宽失效的尖峰密度和梳齿密度）上达到或超越最优固定带宽性能，并在同一框架下实现已知测量误差下的解卷积。由于维纳估计器能解析尖锐结构，但无法像混合模型那样经济地拟合平滑基底，我们通过两种方式将其与高斯混合模型结合：分段划分法，以及由平滑基底与带限残差叠加而成的默认方法。从谱中提取的数据驱动基底替代了假定的$1/n$基底，在堆叠和舍入数据上保持鲁棒性。在Marron-Wand基准测试中（以精确积分平方误差为指标），性能优势随样本量呈现偏差-方差权衡：谱估计器具有低偏差但方差较高，因此$n=100$时交叉验证领先，而$n=5000$时维纳滤波器和叠加方法占据前两位。这些方法在六个真实数据集（CRSP收益率、NHANES自报告、CMS双μ子与SDSS光谱、随机信标流以及UNSW-NB15流量）和合成数据质量检验中均得到验证，所有实验均可复现。