Universal Feature Selection for Simultaneous Interpretability of Multitask Datasets

Extracting meaningful features from complex, high-dimensional datasets across scientific domains remains challenging. Current methods often struggle with scalability, limiting their applicability to large datasets, or make restrictive assumptions about feature-property relationships, hindering their ability to capture complex interactions. BoUTS's general and scalable feature selection algorithm surpasses these limitations to identify both universal features relevant to all datasets and task-specific features predictive for specific subsets. Evaluated on seven diverse chemical regression datasets, BoUTS achieves state-of-the-art feature sparsity while maintaining prediction accuracy comparable to specialized methods. Notably, BoUTS's universal features enable domain-specific knowledge transfer between datasets, and suggest deep connections in seemingly-disparate chemical datasets. We expect these results to have important repercussions in manually-guided inverse problems. Beyond its current application, BoUTS holds immense potential for elucidating data-poor systems by leveraging information from similar data-rich systems. BoUTS represents a significant leap in cross-domain feature selection, potentially leading to advancements in various scientific fields.

翻译：从跨科学领域的复杂高维数据中提取有意义的特征仍然具有挑战性。当前方法常受限于可扩展性，难以应用于大规模数据集，或对特征-属性关系施加限制性假设，阻碍其捕获复杂交互的能力。BoUTS的通用且可扩展的特征选择算法突破了这些局限，能够识别对所有数据集均相关的通用特征以及对特定子集具有预测性的任务特定特征。在七个不同的化学回归数据集上的评估表明，BoUTS在保持与专用方法相当的预测精度的同时，实现了最先进的特征稀疏性。值得注意的是，BoUTS的通用特征支持数据集间的领域知识迁移，并揭示了看似迥异的化学数据集间的深层关联。我们预期这些结果将对人工引导的逆问题产生重要影响。除当前应用外，BoUTS通过利用相似数据丰富系统的信息，在阐明数据匮乏系统方面具有巨大潜力。BoUTS代表了跨领域特征选择的重大飞跃，有望推动多个科学领域的进步。

相关内容

特征选择

关注 5940

特征选择( Feature Selection )也称特征子集选择( Feature Subset Selection , FSS )，或属性选择( Attribute Selection )。是指从已有的M个特征(Feature)中选择N个特征使得系统的特定指标最优化，是从原始特征中选择出一些最有效特征以降低数据集维度的过程,是提高学习算法性能的一个重要手段,也是模式识别中关键的数据预处理步骤。对于一个学习算法来说,好的学习样本是训练模型的关键。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日