In high-dimensional data analysis, such as financial index tracking or biomedical applications, it is crucial to select the few relevant variables while maintaining control over the false discovery rate (FDR). In these applications, strong dependencies often exist among the variables (e.g., stock returns), which can undermine the FDR control property of existing methods like the model-X knockoff method or the T-Rex selector. To address this issue, we have expanded the T-Rex framework to accommodate overlapping groups of highly correlated variables. This is achieved by integrating a nearest neighbors penalization mechanism into the framework, which provably controls the FDR at the user-defined target level. A real-world example of sparse index tracking demonstrates the proposed method's ability to accurately track the S&P 500 index over the past 20 years based on a small number of stocks. An open-source implementation is provided within the R package TRexSelector on CRAN.
翻译:在高维数据分析中,例如金融指数追踪或生物医学应用,关键是要在控制错误发现率(FDR)的同时选择少数相关变量。在这些应用中,变量(如股票收益率)之间往往存在强依赖性,这可能会削弱现有方法(如模型-X knockoff方法或T-Rex选择器)的FDR控制特性。为解决这一问题,我们扩展了T-Rex框架,使其能够处理高度相关变量的重叠组。这是通过在该框架中集成最近邻惩罚机制实现的,该机制能够在用户定义的目标水平上可证明地控制FDR。一个稀疏指数追踪的实际案例表明,所提出方法能够基于少量股票在过去20年内准确追踪标普500指数。该方法的开源实现已收录于CRAN的R包TRexSelector中。