A fundamental issue in machine learning is the robustness of the model with respect to changes in the input. In natural language processing, models typically contain a first embedding layer, transforming a sequence of tokens into vector representations. While the robustness with respect to changes of continuous inputs is well-understood, the situation is less clear when considering discrete changes, for instance replacing a word by another in an input sentence. Our work formally proves that popular embedding schemes, such as concatenation, TF-IDF, and Paragraph Vector (a.k.a. doc2vec), exhibit robustness in the H\"older or Lipschitz sense with respect to the Hamming distance. We provide quantitative bounds for these schemes and demonstrate how the constants involved are affected by the length of the document. These findings are exemplified through a series of numerical examples.
翻译:机器学习中的一个基本问题是模型对输入变化的鲁棒性。在自然语言处理中,模型通常包含一个初始的嵌入层,将标记序列转换为向量表示。虽然关于连续输入变化的鲁棒性已得到充分理解,但在考虑离散变化(例如将输入句子中的一个词替换为另一个词)时,情况尚不明确。我们的工作从理论上证明,流行的嵌入方案(如拼接、TF-IDF 和段落向量,即 doc2vec)相对于汉明距离在 Hölder 或 Lipschitz 意义下具有鲁棒性。我们为这些方案提供了定量边界,并展示了相关常数如何受文档长度影响。这些发现通过一系列数值示例得到了说明。