We consider a binary supervised learning classification problem where instead of having data in a finite-dimensional Euclidean space, we observe measures on a compact space $\mathcal{X}$. Formally, we observe data $D_N = (\mu_1, Y_1), \ldots, (\mu_N, Y_N)$ where $\mu_i$ is a measure on $\mathcal{X}$ and $Y_i$ is a label in $\{0, 1\}$. Given a set $\mathcal{F}$ of base-classifiers on $\mathcal{X}$, we build corresponding classifiers in the space of measures. We provide upper and lower bounds on the Rademacher complexity of this new class of classifiers that can be expressed simply in terms of corresponding quantities for the class $\mathcal{F}$. If the measures $\mu_i$ are uniform over a finite set, this classification task boils down to a multi-instance learning problem. However, our approach allows more flexibility and diversity in the input data we can deal with. While such a framework has many possible applications, this work strongly emphasizes on classifying data via topological descriptors called persistence diagrams. These objects are discrete measures on $\mathbb{R}^2$, where the coordinates of each point correspond to the range of scales at which a topological feature exists. We will present several classifiers on measures and show how they can heuristically and theoretically enable a good classification performance in various settings in the case of persistence diagrams.
翻译:考虑一个二元监督学习分类问题,其中数据并非位于有限维欧几里得空间,而是观测紧空间 $\mathcal{X}$ 上的测度。形式上,我们观测数据 $D_N = (\mu_1, Y_1), \ldots, (\mu_N, Y_N)$,其中 $\mu_i$ 是 $\mathcal{X}$ 上的一个测度,$Y_i$ 是标签 $\{0, 1\}$ 中的值。给定 $\mathcal{X}$ 上一组基分类器集合 $\mathcal{F}$,我们在测度空间中构建相应的分类器。我们给出了这类新分类器的Rademacher复杂度的上界和下界,这些界可以简单地用集合 $\mathcal{F}$ 的相应量表示。如果测度 $\mu_i$ 在有限集上均匀分布,则该分类任务简化为多实例学习问题。然而,我们的方法允许处理输入数据时具有更大的灵活性和多样性。尽管这一框架有多种可能的应用,但本工作特别强调通过称为持续性图的拓扑描述符对数据进行分类。这些对象是 $\mathbb{R}^2$ 上的离散测度,其中每个点的坐标对应拓扑特征存在的尺度范围。我们将介绍多种测度上的分类器,并展示它们如何在持续性图的情况下,在多种场景下启发式地和理论上实现良好的分类性能。