Measurement-constrained problems frequently arise in modern applications such as electronic health record studies. In such problems, despite the availability of large datasets, collecting labeled data can be highly costly or time-consuming, allowing only a small portion of the data to be labeled within a given budget. This raises a critical question: which data points are most beneficial to label given the budget constraint? We study this question in the context of estimating an optimal individualized threshold under a measurement-constrained M-estimation framework. In particular, our goal is to estimate a high-dimensional parameter $θ$ in a linear threshold $θ^TZ$ for a continuous variable $X$ such that the discrepancy between whether $X$ exceeds the threshold $θ^TZ$ and a binary outcome $Y$ is minimized. In the measurement-constrained setting, we propose a novel $K$-step active subsampling algorithm to estimate $θ$, which iteratively samples the most informative observations in the dataset and solves a regularized M-estimator. Our theoretical analysis reveals a sharp phase transition phenomenon with respect to $β$, the smoothness of the conditional density of $X$ given $Y$ and $Z$. Please see the paper for the full abstract.
翻译:测量受限问题在电子健康记录研究等现代应用中频繁出现。在此类问题中,尽管可获得大规模数据集,但收集标注数据可能成本高昂或耗时,使得在给定预算内仅能标注少量数据。这引出一个关键问题:在预算约束下,哪些数据点的标注最具价值?我们围绕测量受限M估计框架下的最优个体化阈值估计问题开展研究。具体而言,我们的目标是估计线性阈值θᵀZ中的高维参数θ(其中Z为连续变量X的协变量),使得X是否超过阈值θᵀZ与二元结果Y之间的差异最小化。在测量受限场景中,我们提出一种新颖的K步主动降采样算法来估计θ,该算法通过迭代选取数据集中最具信息量的观测值,并求解正则化M估计量。理论分析揭示了关于β(给定Y和Z条件下X的条件密度平滑度)的尖锐相变现象。完整摘要请参见论文原文。