Analysis of Diagnostics (Part II): Prevalence, Linear Independence, and Unsupervised Learning

This is the second manuscript in a two-part series that uses diagnostic testing to understand the connection between prevalence (i.e. number of elements in a class), uncertainty quantification (UQ), and classification theory. Part I considered the context of supervised machine learning (ML) and established a duality between prevalence and the concept of relative conditional probability. The key idea of that analysis was to train a family of discriminative classifiers by minimizing a sum of prevalence-weighted empirical risk functions. The resulting outputs can be interpreted as relative probability level-sets, which thereby yield uncertainty estimates in the class labels. This procedure also demonstrated that certain discriminative and generative ML models are equivalent. Part II considers the extent to which these results can be extended to tasks in unsupervised learning through recourse to ideas in linear algebra. We first observe that the distribution of an impure population, for which the class of a corresponding sample is unknown, can be parameterized in terms of a prevalence. This motivates us to introduce the concept of linearly independent populations, which have different but unknown prevalence values. Using this, we identify an isomorphism between classifiers defined in terms of impure and pure populations. In certain cases, this also leads to a nonlinear system of equations whose solution yields the prevalence values of the linearly independent populations, fully realizing unsupervised learning as a generalization of supervised learning. We illustrate our methods in the context of synthetic data and a research-use-only SARS-CoV-2 enzyme-linked immunosorbent assay (ELISA).

翻译：本文是两篇系列论文中的第二篇，该系列通过诊断测试探讨流行度（即类别中的元素数量）、不确定性量化与分类理论之间的联系。第一部分在监督机器学习的背景下展开，建立了流行度与相对条件概率概念之间的对偶关系。该分析的核心思想是通过最小化一组加权流行度经验风险函数来训练判别式分类器族。所得输出可解释为相对概率水平集，从而提供类别标签的不确定性估计。该过程还证明了某些判别式与生成式机器学习模型具有等价性。第二部分探讨了这些结果在多大程度上能借助线性代数思想推广至无监督学习任务。我们首先观察到，对于类别未知的混合群体，其分布可通过流行度进行参数化。这促使我们引入线性独立群体的概念，这些群体具有不同但未知的流行度值。基于此，我们建立了针对混合群体与纯净群体定义的分类器之间的同构关系。在某些情况下，这还会导出一个非线性方程组，其解即为线性独立群体的流行度值，从而将无监督学习完全实现为监督学习的泛化形式。我们通过合成数据及仅供研究使用的SARS-CoV-2酶联免疫吸附测定实验验证了所提方法。