This position paper argues that text embedding research should move beyond surface meaning and embrace implicit semantics as a central modeling objective. Text embeddings are a foundational component of modern NLP, underpinning a wide range of applications and driving sustained research progress. Despite rapid progress, most embedding models remain narrowly focused on surface-level semantics, whereas linguistic theory emphasizes that much of human meaning is implicit, shaped by pragmatics, speaker intent, and sociocultural context. Current models are typically trained on datasets that lack such depth and evaluated using benchmarks that reward surface similarity. As a result, they struggle with tasks that require interpretive reasoning, stance recognition, or socially grounded understanding. Our pilot study makes this limitation explicit, showing that even state-of-the-art embeddings achieve only marginal improvements over simple lexical baselines on tasks probing implicit semantics. We therefore call for a paradigm shift: embedding research should prioritize linguistically grounded and diverse training data, develop benchmarks that probe deeper semantic understanding, and treat implicit meaning as a core modeling objective to better align embeddings with real-world language complexity. The code is available at http://github.com/dukesun99/Implicit-Embeddings.
翻译:本立场论文主张,文本嵌入研究应超越表层含义,将隐含语义作为核心建模目标。文本嵌入是现代自然语言处理的基础组件,支撑着广泛的应用场景并推动持续的研究进展。尽管发展迅速,多数嵌入模型仍局限于聚焦表层语义,而语言学理论强调人类意义的很大一部分是隐含的,由语用学、说话者意图及社会文化语境塑造。当前模型通常在缺乏这种深度的数据集上训练,并使用奖励表层相似性的基准进行评估。因此,它们在需要解释性推理、立场识别或社会性理解的任务中表现困难。我们的初步研究明确了这一局限性,表明即使是最先进的嵌入模型,在探测隐含语义的任务上,相较于简单的词汇基线也仅能取得边际改进。因此,我们呼吁范式转变:嵌入研究应优先采用语言学依据充分且多样化的训练数据,开发探测更深层语义理解的基准,并将隐含意义作为核心建模目标,以使嵌入更好地适应现实语言的复杂性。代码可在 http://github.com/dukesun99/Implicit-Embeddings 获取。