The Transformer architecture is shown to provide a powerful machine transduction framework for online handwritten gestures corresponding to glyph strokes of natural language sentences. The attention mechanism is successfully used to create latent representations of an end-to-end encoder-decoder model, solving multi-level segmentation while also learning some language features and syntax rules. The additional use of a large decoding space with some learned Byte-Pair-Encoding (BPE) is shown to provide robustness to ablated inputs and syntax rules. The encoder stack was directly fed with spatio-temporal data tokens potentially forming an infinitely large input vocabulary, an approach that finds applications beyond that of this work. Encoder transfer learning capabilities is also demonstrated on several languages resulting in faster optimisation and shared parameters. A new supervised dataset of online handwriting gestures suitable for generic handwriting recognition tasks was used to successfully train a small transformer model to an average normalised Levenshtein accuracy of 96% on English or German sentences and 94% in French.
翻译:Transformer架构被证明能为对应自然语言句子笔画轨迹的在线手写手势提供强大的机器转导框架。注意力机制成功用于构建端到端编码器-解码器模型的潜在表示,在解决多层次分割问题的同时,还学习了部分语言特征和句法规则。额外采用结合已学习字节对编码的大解码空间,被证明能增强对残缺输入与句法规则的鲁棒性。编码器堆栈直接接收可构成无限大输入词汇表的时空数据令牌,这种方法的适用范围超越了本研究。研究还展示了编码器在多种语言上的迁移学习能力,实现了更快的优化与参数共享。使用一个适用于通用手写识别任务的新型在线手势监督数据集,成功训练了一个小型Transformer模型,在英语或德语语句上达到96%的标准化Levenshtein准确率,法语语句上达到94%。