Embedding retrofitting adjusts pre-trained word vectors using knowledge graph constraints to improve domain-specific retrieval. However, the effectiveness of retrofitting depends critically on knowledge graph quality, which in turn depends on text preprocessing. This paper presents a data engineering framework that addresses data quality degradation from annotation artifacts in real-world corpora. The analysis shows that hashtag annotations inflate knowledge graph density, leading to creating spurious edges that corrupt the retrofitting objective. On noisy graphs, all retrofitting techniques produce statistically significant degradation ($-3.5\%$ to $-5.2\%$, $p<0.05$). After preprocessing, \acrshort{ewma} retrofitting achieves $+6.2\%$ improvement ($p=0.0348$) with benefits concentrated in quantitative synthesis questions ($+33.8\%$ average). The gap between clean and noisy preprocessing (10\%+ swing) exceeds the gap between algorithms (3\%), establishing preprocessing quality as the primary determinant of retrofitting success.
翻译:嵌入改造通过知识图谱约束调整预训练词向量,以提升领域特定检索性能。然而,改造效果关键取决于知识图谱质量,而图谱质量又受文本预处理影响。本文提出一种数据工程框架,旨在解决现实语料库中因标注伪影导致的数据质量退化问题。分析表明,话题标签标注会虚增知识图谱密度,从而产生破坏改造目标的伪边。在噪声图谱上,所有改造技术均导致统计显著的性能下降(-3.5%至-5.2%,p<0.05)。经预处理后,EWMA改造实现+6.2%的性能提升(p=0.0348),其增益主要集中在定量综合类问题(平均+33.8%)。清洁与噪声预处理间的性能差距(10%+波动)超过算法间的差异(3%),这确立了预处理质量作为改造成功首要决定因素的地位。