Statistical learning on measures: an application to persistence diagrams

We consider a binary supervised learning classification problem where instead of having data in a finite-dimensional Euclidean space, we observe measures on a compact space $\mathcal{X}$. Formally, we observe data $D_N = (μ_1, Y_1), \ldots, (μ_N, Y_N)$ where $μ_i$ is a measure on $\mathcal{X}$ and $Y_i$ is a label in $\{0, 1\}$. Given a set $\mathcal{F}$ of base-classifiers on $\mathcal{X}$, we build corresponding classifiers in the space of measures. We provide upper and lower bounds on the Rademacher complexity of this new class of classifiers that can be expressed simply in terms of corresponding quantities for the class $\mathcal{F}$. If the measures $μ_i$ are uniform over a finite set, this classification task boils down to a multi-instance learning problem. However, our approach allows more flexibility and diversity in the input data we can deal with. While such a framework has many possible applications, this work strongly emphasizes on classifying data via topological descriptors called persistence diagrams. These objects are discrete measures on $\mathbb{R}^2$, where the coordinates of each point correspond to the range of scales at which a topological feature exists. We will present several classifiers on measures and show how they can heuristically and theoretically enable a good classification performance in various settings in the case of persistence diagrams.

翻译：我们考虑一个二元监督学习分类问题，其中观测数据并非有限维欧几里得空间中的点，而是紧空间 $\mathcal{X}$ 上的测度。形式化地，我们观测到数据 $D_N = (μ_1, Y_1), \ldots, (μ_N, Y_N)$，其中 $μ_i$ 是 $\mathcal{X}$ 上的测度，$Y_i$ 是 $\{0, 1\}$ 中的标签。给定 $\mathcal{X}$ 上的一组基分类器 $\mathcal{F}$，我们在测度空间中构建相应的分类器。我们给出了这类新分类器的 Rademacher 复杂度的上下界，这些界可以简单地用 $\mathcal{F}$ 类对应的量来表示。若测度 $μ_i$ 在有限集上是均匀的，则该分类任务可归结为多示例学习问题。然而，我们的方法能够更灵活地处理更多样化的输入数据。尽管该框架具有多种潜在应用，本工作重点强调通过称为持久图的拓扑描述子对数据进行分类。这些对象是 $\mathbb{R}^2$ 上的离散测度，其中每个点的坐标对应拓扑特征存在的尺度范围。我们将展示几种测度上的分类器，并说明在持久图案例中，它们如何从启发式和理论层面在不同设置下实现良好的分类性能。

相关内容

分类器

关注 6

分类是数据挖掘的一种非常重要的方法。分类的概念是在已有数据的基础上学会一个分类函数或构造出一个分类模型（即我们通常所说的分类器(Classifier)）。该函数或模型能够把数据库中的数据纪录映射到给定类别中的某一个，从而可以应用于数据预测。总之，分类器是数据挖掘中对样本进行分类的方法的统称，包含决策树、逻辑回归、朴素贝叶斯、神经网络等算法。