Taxonomies form the backbone of structured knowledge representation across diverse domains, enabling applications such as e-commerce catalogs, semantic search, and biomedical discovery. Yet, manual taxonomy expansion is labor-intensive and cannot keep pace with the emergence of new concepts. Existing automated methods rely on point-based vector embeddings, which model symmetric similarity and thus struggle with the asymmetric "is-a" relationships that are fundamental to taxonomies. Box embeddings offer a promising alternative by enabling containment and disjointness, but they face key issues: (i) unstable gradients at the intersection boundaries, (ii) no notion of semantic uncertainty, and (iii) limited capacity to represent polysemy or ambiguity. We address these shortcomings with TaxoBell, a Gaussian box embedding framework that translates between box geometries and multivariate Gaussian distributions, where means encode semantic location and covariances encode uncertainty. Energy-based optimization yields stable optimization, robust modeling of ambiguous concepts, and interpretable hierarchical reasoning. Extensive experimentation on five benchmark datasets demonstrates that TaxoBell significantly outperforms eight state-of-the-art taxonomy expansion baselines by 19% in MRR and around 25% in Recall@k. We further demonstrate the advantages and pitfalls of TaxoBell with error analysis and ablation studies.
翻译:分类体系构成了跨领域结构化知识表示的基础,支持电子商务目录、语义搜索和生物医学发现等应用。然而,人工扩展分类体系劳动密集,无法跟上新概念出现的速度。现有自动化方法依赖基于点的向量嵌入,这类方法建模对称相似性,因而难以处理分类体系核心的非对称“is-a”关系。盒嵌入通过支持包含与互斥关系提供了有前景的替代方案,但其面临关键问题:(i) 交集边界处的梯度不稳定,(ii) 缺乏语义不确定性表征,(iii) 表示多义性或歧义的能力有限。我们通过TaxoBell解决这些缺陷,该高斯盒嵌入框架实现了盒几何与多元高斯分布之间的转换,其中均值编码语义位置、协方差编码不确定性。基于能量的优化实现了稳定优化、模糊概念的鲁棒建模以及可解释的层次推理。在五个基准数据集上的大量实验表明,TaxoBell在MRR指标上显著优于八种先进分类体系扩展基线方法19%,在Recall@k指标上提升约25%。我们通过误差分析和消融研究进一步论证了TaxoBell的优势与局限。