Word embedding methods (WEMs) are extensively used for representing text data. The dimensionality of these embeddings varies across various tasks and implementations. The effect of dimensionality change on the accuracy of the downstream task is a well-explored question. However, how the dimensionality change affects the bias of word embeddings needs to be investigated. Using the English Wikipedia corpus, we study this effect for two static (Word2Vec and fastText) and two context-sensitive (ElMo and BERT) WEMs. We have two observations. First, there is a significant variation in the bias of word embeddings with the dimensionality change. Second, there is no uniformity in how the dimensionality change affects the bias of word embeddings. These factors should be considered while selecting the dimensionality of word embeddings.
翻译:词嵌入方法(WEMs)被广泛用于表示文本数据。这些嵌入的维度在不同任务和实现中各不相同。维度变化对下游任务准确性的影响是一个已被充分研究的问题。然而,维度变化如何影响词嵌入的偏差仍需深入探究。基于英文维基百科语料库,我们针对两种静态WEMs(Word2Vec和fastText)和两种上下文敏感WEMs(ElMo和BERT)研究了这一效应。我们有两项发现:第一,词嵌入的偏差随维度变化存在显著波动;第二,维度变化对词嵌入偏差的影响缺乏一致性。在选择词嵌入维度时,需综合考虑这些因素。