Structured data in the form of tabular datasets contain features that are distinct and discrete, with varying individual and relative importances to the target. Combinations of one or more features may be more predictive and meaningful than simple individual feature contributions. R's mixed effect linear models library allows users to provide such interactive feature combinations in the model design. However, given many features and possible interactions to select from, model selection becomes an exponentially difficult task. We aim to automate the model selection process for predictions on tabular datasets incorporating feature interactions while keeping computational costs small. The framework includes two distinct approaches for feature selection: a Priority-based Random Grid Search and a Greedy Search method. The Priority-based approach efficiently explores feature combinations using prior probabilities to guide the search. The Greedy method builds the solution iteratively by adding or removing features based on their impact. Experiments on synthetic demonstrate the ability to effectively capture predictive feature combinations.
翻译:以表格数据集形式呈现的结构化数据,其特征具有独特性和离散性,且各特征对目标的个体重要性和相对重要性各异。单一或多个特征的组合可能比简单的个体特征贡献更具预测性和意义。R语言的混合效应线性模型库允许用户在模型设计中提供此类交互式特征组合。然而,当面对众多特征及可能的交互选项时,模型选择成为指数级困难的复杂任务。我们旨在自动化处理表格数据集预测中的模型选择过程,在兼顾特征交互的同时保持较低的计算成本。该框架包含两种不同的特征选择方法:基于优先级的随机网格搜索和贪心搜索法。基于优先级的方法利用先验概率高效探索特征组合以指导搜索过程。贪心方法则通过根据特征影响逐步添加或移除特征来构建解决方案。在合成数据集上的实验表明,该方法能有效捕获具有预测性的特征组合。