Word clouds became a standard tool for presenting results of natural language processing methods such as topic modelling. They exhibit most important words, where word size is often chosen proportional to the relevance of words within a topic. In the latent Dirichlet allocation (LDA) model, word clouds are graphical presentations of a vector of weights for words within a topic. These vectors are the result of a statistical procedure based on a specific corpus. Therefore, they are subject to uncertainty coming from different sources as sample selection, random components in the optimization algorithm, or parameter settings. A novel approach for presenting word clouds including information on such types of uncertainty is introduced and illustrated with an application of the LDA model to conference abstracts.
翻译:词云已成为展示主题建模等自然语言处理方法结果的标准工具。它们突出显示最重要的词语,其中词语大小通常与词语在主题中的相关性成正比。在隐狄利克雷分配(LDA)模型中,词云是主题内词语权重向量的图形化呈现。这些权重向量是基于特定语料库的统计过程的结果,因此会受到样本选择、优化算法中的随机成分或参数设置等不同来源的不确定性影响。本文提出了一种新颖的词云呈现方法,能够包含此类不确定性信息,并通过LDA模型在会议摘要中的应用加以说明。