Signature Methods in Machine Learning

Signature-based techniques give mathematical insight into the interactions between complex streams of evolving data. These insights can be quite naturally translated into numerical approaches to understanding streamed data, and perhaps because of their mathematical precision, have proved useful in analysing streamed data in situations where the data is irregular, and not stationary, and the dimension of the data and the sample sizes are both moderate. Understanding streamed multi-modal data is exponential: a word in $n$ letters from an alphabet of size $d$ can be any one of $d^n$ messages. Signatures remove the exponential amount of noise that arises from sampling irregularity, but an exponential amount of information still remain. This survey aims to stay in the domain where that exponential scaling can be managed directly. Scalability issues are an important challenge in many problems but would require another survey article and further ideas. This survey describes a range of contexts where the data sets are small enough to remove the possibility of massive machine learning, and the existence of small sets of context free and principled features can be used effectively. The mathematical nature of the tools can make their use intimidating to non-mathematicians. The examples presented in this article are intended to bridge this communication gap and provide tractable working examples drawn from the machine learning context. Notebooks are available online for several of these examples. This survey builds on the earlier paper of Ilya Chevryev and Andrey Kormilitzin which had broadly similar aims at an earlier point in the development of this machinery. This article illustrates how the theoretical insights offered by signatures are simply realised in the analysis of application data in a way that is largely agnostic to the data type.

翻译：基于签名的技术为理解复杂演化数据流之间的相互作用提供了数学洞见。这些洞见可自然地转化为处理流式数据的数值方法，或许正因为其数学严谨性，在数据不规则、非平稳且维度和样本量均适中的场景下，已被证明对分析流式数据卓有成效。理解多模态流式数据是指数级难题：由大小为$d$的字母表中$n$个字母构成的单词，可能对应$d^n$种信息。签名方法消除了因采样不规则产生的指数级噪声，但指数级的信息量仍得以保留。本综述旨在聚焦于可直接管理这种指数级扩展的领域。可扩展性问题在许多挑战中至关重要，但需要另一篇综述文章及更多思路进行探讨。本文描述了一系列数据规模较小、无法进行大规模机器学习，但少量无上下文依赖且具有原理性特征可有效使用的场景。这些工具的数学性质可能令非数学背景的研究者望而生畏。本文提供的示例旨在弥合这一沟通鸿沟，并呈现源自机器学习场景的易处理工作实例。其中多个示例的在线笔记本可供查阅。本综述建立在伊利亚·切维列夫与安德烈·科尔米利岑早期论文的基础之上，该论文在该方法论发展的早期阶段有着广泛相似的目标。本文阐述了签名理论洞见如何以对数据类型基本无关的方式，在应用数据分析中得以简单实现。