Text analysis is an interesting research area in data science and has various applications, such as in artificial intelligence, biomedical research, and engineering. We review popular methods for text analysis, ranging from topic modeling to the recent neural language models. In particular, we review Topic-SCORE, a statistical approach to topic modeling, and discuss how to use it to analyze MADStat - a dataset on statistical publications that we collected and cleaned. The application of Topic-SCORE and other methods on MADStat leads to interesting findings. For example, $11$ representative topics in statistics are identified. For each journal, the evolution of topic weights over time can be visualized, and these results are used to analyze the trends in statistical research. In particular, we propose a new statistical model for ranking the citation impacts of $11$ topics, and we also build a cross-topic citation graph to illustrate how research results on different topics spread to one another. The results on MADStat provide a data-driven picture of the statistical research in $1975$--$2015$, from a text analysis perspective.
翻译:文本分析是数据科学中一个有趣的研究领域,在人工智能、生物医学研究和工程等领域具有广泛应用。我们综述了文本分析的流行方法,涵盖从主题建模到最新神经语言模型的各种技术。特别地,我们回顾了Topic-SCORE这一主题建模的统计方法,并讨论了如何将其应用于MADStat——我们收集并清理的统计学期刊数据集。在MADStat上应用Topic-SCORE及其他方法获得了有趣的发现。例如,识别出统计学的$11$个代表性主题。对每种期刊,可以可视化主题权重随时间的演变,并将这些结果用于分析统计学研究趋势。此外,我们提出了一个新的统计模型来对$11$个主题的引用影响力进行排序,并构建了一个跨主题引用图,以展示不同主题的研究成果如何相互传播。MADStat上的分析结果从文本分析视角提供了统计学研究在$1975$-$2015$年期间的数据驱动图景。