Covariate Selection for Joint Latent Space Modeling of Sparse Network Data

Network data are increasingly common in the social sciences and infectious disease epidemiology. Analyses often link network structure to node-level covariates, but existing methods falter with sparse networks and high-dimensional node features. We propose a joint latent space modeling framework for sparse networks with high-dimensional binary node covariates that performs covariate selection while accounting for uncertainty in estimated latent positions. Building on joint latent space models that couple edges and node variables through shared latent positions, we introduce a group lasso screening step and incorporate a measurement-error-aware stabilization term to mitigate bias from using estimated latent positions as predictors. We establish prediction error rates for the covariate component both when latent positions are treated as observed and when they are estimated with bounded error; under uniform control across $q$ covariates and $n$ nodes, the rate is of order $O(\log q / n)$ up to an additional term due to latent position estimation error. Our method addresses three challenges: (1) incorporating information from isolated nodes, which are common in sparse networks but often ignored; (2) selecting relevant covariates from high-dimensional spaces; and (3) accounting for uncertainty in estimated latent positions. Simulations show predictive performance remains stable as covariate sparsity grows, while naive approaches degrade. We illustrate how the method can support efficient study design using household social networks from 75 Indian villages, where an emulated pilot study screens a large covariate battery and substantially reduces required subsequent data collection without sacrificing network predictive accuracy.

翻译：网络数据在社会科学和传染病流行病学中日益普遍。分析通常将网络结构与节点级协变量相关联，但现有方法在处理稀疏网络和高维节点特征时存在不足。本文提出了一种面向高维二元节点协变量稀疏网络的联合潜在空间建模框架，该框架在执行协变量选择的同时考虑了估计潜在位置的不确定性。基于通过共享潜在位置耦合边与节点变量的联合潜在空间模型，我们引入了组套索筛选步骤，并纳入考虑测量误差的稳定项，以缓解将估计潜在位置作为预测变量时产生的偏差。我们建立了协变量分量的预测误差率，包括将潜在位置视为观测值的情况，以及当潜在位置以有界误差估计时的情况：在$q$个协变量和$n$个节点上实现均匀控制的条件下，该误差率量级为$O(\log q / n)$，并附加一项由潜在位置估计误差引起的修正项。我们的方法解决了三个挑战：(1) 纳入稀疏网络中常见但常被忽略的孤立节点信息；(2) 从高维空间中选择相关协变量；(3) 考虑估计潜在位置的不确定性。仿真表明，随着协变量稀疏性增加，预测性能保持稳定，而朴素方法则性能下降。我们通过印度75个村庄的家庭社交网络案例，展示了该方法如何支持高效研究设计：模拟试点研究筛选了大量协变量组合，在保持网络预测准确性的同时，显著减少了后续所需的数据收集量。