A statistical significance testing approach for measuring term burstiness with applications to domain-specific terminology extraction

A term in a corpus is said to be ``bursty'' (or overdispersed) when its occurrences are concentrated in few out of many documents. In this paper, we propose Residual Inverse Collection Frequency (RICF), a statistical significance test inspired heuristic for quantifying term burstiness. The chi-squared test is, to our knowledge, the sole test of statistical significance among existing term burstiness measures. Chi-squared test term burstiness scores are computed from the collection frequency statistic (i.e., the proportion that a specified term constitutes in relation to all terms within a corpus). However, the document frequency of a term (i.e., the proportion of documents within a corpus in which a specific term occurs) is exploited by certain other widely used term burstiness measures. RICF addresses this shortcoming of the chi-squared test by virtue of its term burstiness scores systematically incorporating both the collection frequency and document frequency statistics. We evaluate the RICF measure on a domain-specific technical terminology extraction task using the GENIA Term corpus benchmark, which comprises 2,000 annotated biomedical article abstracts. RICF generally outperformed the chi-squared test in terms of precision at k score with percent improvements of 0.00% (P@10), 6.38% (P@50), 6.38% (P@100), 2.27% (P@500), 2.61% (P@1000), and 1.90% (P@5000). Furthermore, RICF performance was competitive with the performances of other well-established measures of term burstiness. Based on these findings, we consider our contributions in this paper as a promising starting point for future exploration in leveraging statistical significance testing in text analysis.

翻译：语料库中的某个术语若其出现集中于少数文档中，则称该术语具有“突发性”（或过度离散性）。本文提出残差逆文档频率（RICF），一种受统计显著性检验启发的启发式方法，用于量化术语突发性。据我们所知，卡方检验是现有术语突发性衡量方法中唯一的统计显著性检验。卡方检验的术语突发性分数基于集合频率统计量（即特定术语在语料库中所有术语中的占比）计算，然而某些广泛使用的术语突发性衡量方法利用了术语的文档频率（即包含该术语的文档在语料库中的比例）。RICF通过系统整合集合频率与文档频率统计量来弥补卡方检验的这一不足。我们使用包含2000篇已标注生物医学文献摘要的GENIA Term基准语料库，在领域特定技术术语提取任务中评估RICF方法。结果显示，RICF在k精度指标上普遍优于卡方检验，提升幅度分别为0.00%（P@10）、6.38%（P@50）、6.38%（P@100）、2.27%（P@500）、2.61%（P@1000）和1.90%（P@5000）。此外，RICF的性能与其他成熟的术语突发性衡量方法具有竞争力。基于这些发现，我们认为本文的贡献为未来探索统计显著性检验在文本分析中的应用提供了有前景的起点。