In high-dimensional data analysis, such as financial index tracking or biomedical applications, it is crucial to select the few relevant variables while maintaining control over the false discovery rate (FDR). In these applications, strong dependencies often exist among the variables (e.g., stock returns), which can undermine the FDR control property of existing methods like the model-X knockoff method or the T-Rex selector. To address this issue, we have expanded the T-Rex framework to accommodate overlapping groups of highly correlated variables. This is achieved by integrating a nearest neighbors penalization mechanism into the framework, which provably controls the FDR at the user-defined target level. A real-world example of sparse index tracking demonstrates the proposed method's ability to accurately track the S&P 500 index over the past 20 years based on a small number of stocks. An open-source implementation is provided within the R package TRexSelector on CRAN.
翻译:在高维数据分析(如金融指数追踪或生物医学应用)中,在控制错误发现率(FDR)的同时选择少量相关变量至关重要。此类应用中,变量(例如股票收益率)间常存在强依赖性,这可能导致模型-X脱钩方法或T-Rex选择器等现有方法的FDR控制属性失效。为解决此问题,我们扩展了T-Rex框架以适应高度相关变量的重叠组。通过在该框架中集成最近邻惩罚机制,该方法可在用户定义的目标水平上可证明地控制FDR。基于稀疏指数追踪的实际案例表明,所提方法能够利用少量股票准确追踪过去20年的标普500指数。该方法的开源实现已收录于CRAN平台的R包TRexSelector中。