Measuring the breadth of a word's meaning, or its spread across contexts, has become feasible with contextualized token embeddings. A word type can be represented as a cloud of token vectors, with dispersion-based statistics serving as proxies for contextual diversity (Nagata and Tanaka-Ishii, ACL2025). These measurements are useful for deciding appropriate sense distinctions when constructing thesauri and domain-specific dictionaries. However, when comparing the breadth of two word types, naive hypothesis testing on dispersion can be misleading: differences in semantic direction can masquerade as dispersion differences, inflating Type-I error and yielding "statistically significant" outcomes even when there is no true breadth difference. This is problematic because significance testing should distinguish genuine effects from incidental fluctuations in small-difference regimes. We propose a Householder-aligned permutation test to isolate dispersion differences from directional differences. Our method applies a single Householder reflection to align the mean directions of the two word types and then performs a permutation test on the aligned token clouds, yielding calibrated, non-parametric p-values. For practicality, we introduce a GPU-oriented implementation that batches permutations and linear algebra operations. Empirically, our alignment reduced Type-I error by 32.5% while preserving sensitivity to genuine breadth differences, and achieved a 23x speedup over the CPU baseline.
翻译:衡量词语意义的广度(即其跨语境分布范围),已通过上下文令牌嵌入技术成为可能。一个词类可表示为令牌向量的集合,基于离散度的统计量可作为语境多样性的代理指标(Nagata 与 Tanaka-Ishii,ACL2025)。这类测量在构建同义词库及领域专用词典时,对判定恰当义项区分具有实用价值。然而,在比较两个词类的语义广度时,基于离散度的朴素假设检验可能产生误导:语义方向的差异可能伪装成离散度差异,导致第一类错误膨胀,即便不存在真正的广度差异,也会得出"统计显著"的结论。这存在根本性缺陷,因为显著性检验应当区分小差异区间内真实效应与偶然波动。我们提出基于Householder对齐的置换检验方法,可将离散度差异与方向差异分离。该方法通过单次Householder反射对齐两个词类的平均方向,随后对对齐后的令牌向量集合执行置换检验,生成经校正的非参数化p值。为提升实用性,我们引入面向GPU的实现方案,将置换运算与线性代数操作进行批处理。实验表明,本方法的对齐操作在保持对真实广度差异敏感性的同时,将第一类错误降低了32.5%,并实现了较CPU基线23倍的速度提升。