Text embeddings are essential components in modern NLP pipelines. Although numerous embedding models have been proposed, no single model consistently dominates across domains and tasks. This variability motivates the use of ensemble techniques to combine complementary strengths. However, most existing ensemble methods operate on deterministic embeddings and fail to account for model-specific uncertainty, limiting their robustness and reliability in downstream applications. To address these limitations, we propose Uncertainty-driven Embedding Convolution (UEC). UEC first transforms deterministic embeddings into probabilistic ones in a post-hoc manner. It then computes adaptive ensemble coefficients based on embedding uncertainty, derived from a principled surrogate-loss formulation. Additionally, UEC employs an uncertainty-aware similarity function that directly incorporates uncertainty into the similarity scoring, providing a theoretically grounded and efficient surrogate to distributional distances. Extensive experiments on diverse benchmarks demonstrate that UEC consistently improves both performance and robustness by leveraging principled uncertainty modeling.
翻译:文本嵌入是现代自然语言处理流程中的核心组件。尽管已有众多嵌入模型被提出,但尚无单一模型能在所有领域和任务中持续占据主导地位。这种变异性促使人们采用集成技术以融合互补优势。然而,现有的大多数集成方法均基于确定性嵌入进行操作,未能考虑模型特定的不确定性,从而限制了其在下游应用中的鲁棒性和可靠性。为应对这些局限,本文提出不确定性驱动的嵌入卷积方法。该方法首先以后处理方式将确定性嵌入转换为概率性嵌入,随后基于嵌入不确定性(源自一种有理论依据的代理损失公式)计算自适应集成系数。此外,UEC采用一种不确定性感知的相似度函数,直接将不确定性纳入相似性评分中,为分布距离提供了理论完备且高效的代理度量。在多样化基准测试上的大量实验表明,通过利用有理论依据的不确定性建模,UEC能持续提升性能与鲁棒性。