This work traces the evolution of word-embedding techniques within the natural language processing (NLP) literature. We collect and analyze 149 research articles spanning the period from 1954 to 2025, providing both a comprehensive methodological review and a data-driven bibliometric analysis of how representation learning has developed over seven decades. Our study covers four major embedding paradigms, statistical representation-based methods (one-hot encoding, bag-of-words, TF-IDF), static word embeddings (Word2Vec, GloVe, FastText), contextual word embeddings (ELMo, BERT, GPT), and sentence/document embeddings, critically discussing the strengths, limitations, and intellectual lineage connecting each category. Beyond the methodological survey, we conduct a formal era comparison using GPT-3's release as a dividing line, applying seven hypothesis tests to quantify shifts in research focus, collaboration patterns, and institutional involvement. Our analysis reveals a dramatic post-GPT-3 paradigm shift: contextual and sentence-level methods now dominate at 6.4X the odds of the pre-GPT-3 era, mean team sizes have grown significantly (p = 0.018), and 30 entirely new techniques have emerged while 54 pre-GPT-3 methods received no further attention. These findings, combined with evidence of rising industry involvement, provide a quantitative account of how the field's epistemic priorities have been reshaped by the advent of large language models.
翻译:本文追踪了自然语言处理(NLP)文献中词嵌入技术的演变过程。我们收集并分析了跨越1954年至2025年的149篇研究论文,提供了对表征学习在过去七十年间如何发展的综合性方法论回顾及基于数据的文献计量分析。本研究覆盖四大主要嵌入范式:基于统计表示的方法(独热编码、词袋模型、TF-IDF)、静态词嵌入(Word2Vec、GloVe、FastText)、上下文词嵌入(ELMo、BERT、GPT)以及句子/文档嵌入,并批判性地讨论了每种类别的优势、局限性及知识传承脉络。除方法论综述外,我们以GPT-3的发布为分界线进行了正式的时代对比,应用七项假设检验来量化研究重心、合作模式及机构参与度的变化。分析揭示了后GPT-3时代的显著范式转变:上下文与句子级方法在发生概率上达到前GPT-3时代的6.4倍,团队平均规模显著扩大(p = 0.018),30种全新技术涌现,而54种前GPT-3时代的方法则不再受到关注。这些发现,连同产业参与度上升的证据,定量呈现了大型语言模型的出现如何重塑该领域认识论优先级的格局。