Synthetic datasets constructed from formal languages allow fine-grained examination of the learning and generalization capabilities of machine learning systems for sequence classification. This article presents a new benchmark for machine learning systems on sequence classification called MLRegTest, which contains training, development, and test sets from 1,800 regular languages. Different kinds of formal languages represent different kinds of long-distance dependencies, and correctly identifying long-distance dependencies in sequences is a known challenge for ML systems to generalize successfully. MLRegTest organizes its languages according to their logical complexity (monadic second order, first order, propositional, or monomial expressions) and the kind of logical literals (string, tier-string, subsequence, or combinations thereof). The logical complexity and choice of literal provides a systematic way to understand different kinds of long-distance dependencies in regular languages, and therefore to understand the capacities of different ML systems to learn such long-distance dependencies. Finally, the performance of different neural networks (simple RNN, LSTM, GRU, transformer) on MLRegTest is examined. The main conclusion is that performance depends significantly on the kind of test set, the class of language, and the neural network architecture.
翻译:基于形式语言构建的合成数据集能够精细考察机器学习系统在序列分类任务中的学习与泛化能力。本文提出一种名为MLRegTest的序列分类机器学习新基准,该基准包含来自1,800种正则语言的训练集、开发集和测试集。不同类型的形式语言表征着不同种类的长距离依赖关系,而准确识别序列中的长距离依赖是机器学习系统实现成功泛化的已知挑战。MLRegTest根据其逻辑复杂度(一元二阶逻辑、一阶逻辑、命题逻辑或单项式表达式)及逻辑文字类型(字符串、层级字符串、子序列或其组合)对语言进行系统化组织。逻辑复杂度与文字类型的选择为理解正则语言中不同种类的长距离依赖提供了系统化方法,进而可探究不同机器学习系统学习此类长距离依赖的能力。最后,本文考察了不同神经网络(简单RNN、LSTM、GRU、transformer)在MLRegTest上的性能表现。主要结论表明:性能表现显著依赖于测试集类型、语言类别及神经网络架构。