We present a new data analysis perspective to determine variable importance regardless of the underlying learning task. Traditionally, variable selection is considered an important step in supervised learning for both classification and regression problems. The variable selection also becomes critical when costs associated with the data collection and storage are considerably high for cases like remote sensing. Therefore, we propose a new methodology to select important variables from the data by first creating dependency networks among all variables and then ranking them (i.e. nodes) by graph centrality measures. Selecting Top-$n$ variables according to preferred centrality measure will yield a strong candidate subset of variables for further learning tasks e.g. clustering. We present our tool as a Shiny app which is a user-friendly interface development environment. We also extend the user interface for two well-known unsupervised variable selection methods from literature for comparison reasons.
翻译:我们提出了一种新的数据分析视角,用于确定变量重要性,而无需考虑底层学习任务。传统上,变量选择被视为分类和回归问题中监督学习的重要步骤。当数据收集和存储成本较高(如遥感案例)时,变量选择也显得至关重要。因此,我们提出了一种新方法,首先通过构建所有变量之间的依赖网络,然后通过图中心性指标对节点进行排序,从而从数据中选择重要变量。根据首选中心性指标选择Top-$n$变量,将为后续学习任务(例如聚类)生成一个强候选变量子集。我们将此工具以Shiny应用程序的形式呈现,这是一种用户友好的界面开发环境。我们还扩展了用户界面,以集成文献中两种著名的无监督变量选择方法,用于比较分析。