Machine learning holds tremendous promise for transforming the fundamental practice of scientific discovery by virtue of its data-driven nature. With the ever-increasing stream of research data collection, it would be appealing to autonomously explore patterns and insights from observational data for discovering novel classes of phenotypes and concepts. However, in the biomedical domain, there are several challenges inherently presented in the cumulated data which hamper the progress of novel class discovery. The non-i.i.d. data distribution accompanied by the severe imbalance among different groups of classes essentially leads to ambiguous and biased semantic representations. In this work, we present a geometry-constrained probabilistic modeling treatment to resolve the identified issues. First, we propose to parameterize the approximated posterior of instance embedding as a marginal von MisesFisher distribution to account for the interference of distributional latent bias. Then, we incorporate a suite of critical geometric properties to impose proper constraints on the layout of constructed embedding space, which in turn minimizes the uncontrollable risk for unknown class learning and structuring. Furthermore, a spectral graph-theoretic method is devised to estimate the number of potential novel classes. It inherits two intriguing merits compared to existent approaches, namely high computational efficiency and flexibility for taxonomy-adaptive estimation. Extensive experiments across various biomedical scenarios substantiate the effectiveness and general applicability of our method.
翻译:机器学习凭借其数据驱动的特性,为变革科学发现的基础实践带来了巨大潜力。随着研究数据收集的持续增长,从观测数据中自主探索模式和洞察,以发现新型表型与概念类别,具有重要吸引力。然而,在生物医学领域,累积数据中固有的多重挑战阻碍了新型类别发现的进展。非独立同分布的数据分布伴随各类别间的严重不平衡,从根本上导致了语义表征的模糊与偏倚。本研究提出一种几何约束概率建模方法来解决上述问题。首先,我们采用边缘冯·米塞斯-费舍尔分布对实例嵌入的近似后验进行参数化,以应对分布潜在偏差的干扰。随后,通过引入一组关键几何特性对构建的嵌入空间布局施加合理约束,从而最小化未知类别学习与结构化过程中的不可控风险。此外,我们设计了一种基于谱图理论的方法来估计潜在新类别的数量。与现有方法相比,该方法具有两个突出优势:高计算效率与适应分类学估计的灵活性。在多种生物医学场景下的大规模实验验证了该方法的有效性与广泛适用性。