As a fundamental task in natural language processing, word embedding converts each word into a representation in a vector space. A challenge with word embedding is that as the vocabulary grows, the vector space's dimension increases, which can lead to a vast model size. Storing and processing word vectors are resource-demanding, especially for mobile edge-devices applications. This paper explores word embedding dimension reduction. To balance computational costs and performance, we propose an efficient and effective weakly-supervised feature selection method named WordFS. It has two variants, each utilizing novel criteria for feature selection. Experiments on various tasks (e.g., word and sentence similarity and binary and multi-class classification) indicate that the proposed WordFS model outperforms other dimension reduction methods at lower computational costs. We have released the code for reproducibility along with the paper.
翻译:作为自然语言处理中的一项基础任务,词嵌入将每个词语转换为向量空间中的表示。词嵌入面临的一个挑战是,随着词汇量的增长,向量空间的维度会相应增加,这可能导致模型规模急剧膨胀。存储和处理词向量对计算资源要求很高,在移动边缘设备应用中尤为突出。本文探讨了词嵌入的降维问题。为平衡计算成本与性能,我们提出了一种高效且有效的弱监督特征选择方法,命名为WordFS。该方法包含两种变体,每种变体均采用新颖的特征选择准则。在多种任务(如词语与句子相似度计算、二分类与多分类任务)上的实验表明,所提出的WordFS模型在较低计算成本下优于其他降维方法。我们已随论文公开代码以确保可复现性。