An initial procedure in text-as-data applications is text preprocessing. One of the typical steps, which can substantially facilitate computations, consists in removing infrequent words believed to provide limited information about the corpus. Despite popularity of vocabulary pruning, not many guidelines on how to implement it are available in the literature. The aim of the paper is to fill this gap by examining the effects of removing infrequent words for the quality of topics estimated using Latent Dirichlet Allocation. The analysis is based on Monte Carlo experiments taking into account different criteria for infrequent terms removal and various evaluation metrics. The results indicate that pruning is beneficial and that the share of vocabulary which might be eliminated can be quite considerable.
翻译:文本数据应用中的首要步骤是文本预处理。其中一项典型操作——移除被认为信息量有限的低频词汇——能够显著简化计算。尽管词汇剪枝广为使用,但文献中鲜有指导性建议。本文旨在通过考察移除低频词汇对基于潜在狄利克雷分配模型估算的主题质量的影响来填补这一空白。研究基于蒙特卡洛实验,考虑了低频术语移除的不同标准及多种评估指标。结果表明剪枝是有益的,且可被剔除的词汇比例相当可观。