This paper describes RETVec, an efficient, resilient, and multilingual text vectorizer designed for neural-based text processing. RETVec combines a novel character encoding with an optional small embedding model to embed words into a 256-dimensional vector space. The RETVec embedding model is pre-trained using pair-wise metric learning to be robust against typos and character-level adversarial attacks. In this paper, we evaluate and compare RETVec to state-of-the-art vectorizers and word embeddings on popular model architectures and datasets. These comparisons demonstrate that RETVec leads to competitive, multilingual models that are significantly more resilient to typos and adversarial text attacks. RETVec is available under the Apache 2 license at https://github.com/google-research/retvec.
翻译:本文介绍了RETVec,一种专为基于神经网络的文本处理设计的高效、弹性且多语言的文本向量化器。RETVec结合了一种新颖的字符编码与一个可选的小型嵌入模型,将单词嵌入到256维向量空间中。RETVec嵌入模型通过成对度量学习进行预训练,以增强对拼写错误和字符级对抗攻击的鲁棒性。本文评估并比较了RETVec与主流模型架构和数据集上的最新向量化器及词嵌入方法。这些比较表明,RETVec能够构建出具有竞争力且显著更抗拼写错误与对抗性文本攻击的多语言模型。RETVec基于Apache 2许可证发布,源码见https://github.com/google-research/retvec。