A fundamental issue in natural language processing is the robustness of the models with respect to changes in the input. One critical step in this process is the embedding of documents, which transforms sequences of words or tokens into vector representations. Our work formally proves that popular embedding schemes, such as concatenation, TF-IDF, and Paragraph Vector (a.k.a. doc2vec), exhibit robustness in the H\"older or Lipschitz sense with respect to the Hamming distance. We provide quantitative bounds for these schemes and demonstrate how the constants involved are affected by the length of the document. These findings are exemplified through a series of numerical examples.
翻译:自然语言处理中的一个基本问题是模型对输入变化的鲁棒性。在此过程中,关键步骤之一是文档嵌入,它将单词或标记序列转换为向量表示。我们的工作从理论上证明了串联、TF-IDF和段落向量(即doc2vec)等常用嵌入方案在汉明距离下具有Hölder或Lipschitz意义下的鲁棒性。我们为这些方案提供了定量界限,并展示了相关常数如何受文档长度影响。这些发现通过一系列数值示例得到了验证。