Domain-specific terminology extraction is an important task in text analysis. A term in a corpus is said to be "bursty" when its occurrences are concentrated in few out of many documents. Being content rich, bursty terms are highly suited for subject matter characterization, and serve as natural candidates for identifying with technical terminology. Multiple measures of term burstiness have been proposed in the literature. However, the statistical significance testing paradigm has remained underexplored in text analysis, including in relation to term burstiness. To test these waters, we propose as our main contribution a multinomial language model-based exact test of statistical significance for term burstiness. Due to its prohibitive computational cost, we advance a heuristic formula designed to serve as a proxy for test P-values. As a complementary theoretical contribution, we derive a previously unreported relationship connecting the inverse document frequency and inverse collection frequency (two foundational quantities in text analysis) under the multinomial language model. The relation is used in the evaluation of our heuristic. Using the GENIA Term corpus benchmark, we compare our approach against established methods, demonstrating our heuristic's potential in identifying domain-specific technical terms. We hope this demonstration of statistical significance testing in text analysis serves as a springboard for future research.
翻译:领域术语提取是文本分析中的一项重要任务。当语料库中某个术语的出现集中于少量文档而非大量文档时,该术语被称为具有“突发性”。由于富含内容信息,突发性术语非常适合用于主题特征刻画,并自然成为识别技术术语的候选对象。文献中已提出了多种术语突发性度量方法。然而,统计显著性检验范式在文本分析中(包括与术语突发性相关的领域)仍未得到充分探索。为探讨这一方向,我们提出一项主要贡献:基于多项语言模型的术语突发性精确统计显著性检验方法。鉴于其计算成本过高,我们进一步提出一种启发式公式,用于近似检验的P值。作为补充性理论贡献,我们推导出在多项语言模型下反向文档频率与反向集合频率(文本分析中的两个基础量)之间此前未报道的关联关系。该关系被用于评估我们提出的启发式方法。利用GENIA术语语料库基准,我们将所提方法与现有方法进行比较,证明了我们的启发式方法在识别领域特定技术术语方面的潜力。我们期望这项文本分析中统计显著性检验的实证研究能够为未来研究提供跳板。