Signature Methods in Machine Learning

Signature-based techniques give mathematical insight into the interactions between complex streams of evolving data. These insights can be quite naturally translated into numerical approaches to understanding streamed data, and perhaps because of their mathematical precision, have proved useful in analysing streamed data in situations where the data is irregular, and not stationary, and the dimension of the data and the sample sizes are both moderate. Understanding streamed multi-modal data is exponential: a word in $n$ letters from an alphabet of size $d$ can be any one of $d^n$ messages. Signatures remove the exponential amount of noise that arises from sampling irregularity, but an exponential amount of information still remain. This survey aims to stay in the domain where that exponential scaling can be managed directly. Scalability issues are an important challenge in many problems but would require another survey article and further ideas. This survey describes a range of contexts where the data sets are small enough to remove the possibility of massive machine learning, and the existence of small sets of context free and principled features can be used effectively. The mathematical nature of the tools can make their use intimidating to non-mathematicians. The examples presented in this article are intended to bridge this communication gap and provide tractable working examples drawn from the machine learning context. Notebooks are available online for several of these examples. This survey builds on the earlier paper of Ilya Chevryev and Andrey Kormilitzin which had broadly similar aims at an earlier point in the development of this machinery. This article illustrates how the theoretical insights offered by signatures are simply realised in the analysis of application data in a way that is largely agnostic to the data type.

翻译：基于签名的方法为复杂演化数据流之间的相互作用提供了数学洞见。这些洞见可以很自然地转化为理解流式数据的数值方法，并且可能因其数学精确性，在数据不规则、非平稳且数据维度和样本量均适中的流式数据分析中被证明是有效的。理解流式多模态数据是指数级复杂的：一个由 $d$ 个字母组成的字母表中，$n$ 个字母构成的词可以是 $d^n$ 种消息中的任何一种。签名去除了由采样不规则性产生的指数级噪声，但仍有指数级的信息保留下来。本文综述旨在停留在可以直接处理这种指数级缩放问题的领域。可扩展性问题在许多任务中是重要挑战，但需要另一篇综述文章和进一步的想法。本文描述了一系列数据集足够小以排除大规模机器学习可能性的背景，并且可以有效地使用小规模的无上下文且原则性的特征集。这些工具的数学性质可能使非数学工作者对其使用望而生畏。本文中提供的示例旨在弥合这一沟通鸿沟，并提供源自机器学习背景的可行工作实例。其中几个示例的笔记本可在网上获取。本综述建立在 Ilya Chevryev 和 Andrey Kormilitzin 早期论文的基础上，该论文在相关机制发展的较早阶段具有大致相似的目标。本文阐明了签名所提供的理论洞见如何以基本与数据类型无关的方式，在应用数据分析中得以简单实现。