Extracting meaningful features from complex, high-dimensional datasets across scientific domains remains challenging. Current methods often struggle with scalability, limiting their applicability to large datasets, or make restrictive assumptions about feature-property relationships, hindering their ability to capture complex interactions. BoUTS's general and scalable feature selection algorithm surpasses these limitations to identify both universal features relevant to all datasets and task-specific features predictive for specific subsets. Evaluated on seven diverse chemical regression datasets, BoUTS achieves state-of-the-art feature sparsity while maintaining prediction accuracy comparable to specialized methods. Notably, BoUTS's universal features enable domain-specific knowledge transfer between datasets, and suggest deep connections in seemingly-disparate chemical datasets. We expect these results to have important repercussions in manually-guided inverse problems. Beyond its current application, BoUTS holds immense potential for elucidating data-poor systems by leveraging information from similar data-rich systems. BoUTS represents a significant leap in cross-domain feature selection, potentially leading to advancements in various scientific fields.
翻译:从跨科学领域的复杂高维数据中提取有意义的特征仍然具有挑战性。当前方法常受限于可扩展性,难以应用于大规模数据集,或对特征-属性关系施加限制性假设,阻碍其捕获复杂交互的能力。BoUTS的通用且可扩展的特征选择算法突破了这些局限,能够识别对所有数据集均相关的通用特征以及对特定子集具有预测性的任务特定特征。在七个不同的化学回归数据集上的评估表明,BoUTS在保持与专用方法相当的预测精度的同时,实现了最先进的特征稀疏性。值得注意的是,BoUTS的通用特征支持数据集间的领域知识迁移,并揭示了看似迥异的化学数据集间的深层关联。我们预期这些结果将对人工引导的逆问题产生重要影响。除当前应用外,BoUTS通过利用相似数据丰富系统的信息,在阐明数据匮乏系统方面具有巨大潜力。BoUTS代表了跨领域特征选择的重大飞跃,有望推动多个科学领域的进步。