Successful machine learning methods require a trade-off between memorization and generalization. Too much memorization and the model cannot generalize to unobserved examples. Too much over-generalization and we risk under-fitting the data. While we commonly measure their performance through cross validation and accuracy metrics, how should these algorithms cope in domains that are extremely under-determined where accuracy is always unsatisfactory? We present a novel probabilistic graphical model structure learning approach that can learn, generalize and explain in these elusive domains by operating at the random variable instantiation level. Using Minimum Description Length (MDL) analysis, we propose a new decomposition of the learning problem over all training exemplars, fusing together minimal entropy inferences to construct a final knowledge base. By leveraging Bayesian Knowledge Bases (BKBs), a framework that operates at the instantiation level and inherently subsumes Bayesian Networks (BNs), we develop both a theoretical MDL score and associated structure learning algorithm that demonstrates significant improvements over learned BNs on 40 benchmark datasets. Further, our algorithm incorporates recent off-the-shelf DAG learning techniques enabling tractable results even on large problems. We then demonstrate the utility of our approach in a significantly under-determined domain by learning gene regulatory networks on breast cancer gene mutational data available from The Cancer Genome Atlas (TCGA).
翻译:成功的机器学习方法需要在记忆与泛化之间取得平衡。过度记忆会导致模型无法泛化至未观测样本,而过度泛化则可能引发欠拟合风险。尽管我们通常通过交叉验证和准确率指标评估模型性能,但在极度欠定且准确率始终不理想的领域,这些算法应如何应对?我们提出一种新型概率图模型结构学习方法,通过作用于随机变量实例化层面,能够在该类模糊领域实现学习、泛化与解释。基于最小描述长度(MDL)分析,我们提出一种将所有训练样本的学习问题重新分解的新范式,通过融合最小熵推断来构建最终知识库。通过利用在实例化层面运作且本质上囊括贝叶斯网络(BNs)的贝叶斯知识库(BKBs)框架,我们同时开发了理论MDL评分及其关联结构学习算法,在40个基准数据集上展现出较已学习BNs的显著性能提升。此外,该算法整合了近期现成有向无环图(DAG)学习技术,即使面对大规模问题也能获得可处理结果。最后,我们通过癌症基因组图谱(TCGA)的乳腺癌基因突变数据学习基因调控网络,在显著欠定领域验证了该方法的实用性。