Topic Modelling (TM) is from the research branches of natural language understanding (NLU) and natural language processing (NLP) that is to facilitate insightful analysis from large documents and datasets, such as a summarisation of main topics and the topic changes. This kind of discovery is getting more popular in real-life applications due to its impact on big data analytics. In this study, from the social-media and healthcare domain, we apply popular Latent Dirichlet Allocation (LDA) methods to model the topic changes in Swedish newspaper articles about Coronavirus. We describe the corpus we created including 6515 articles, methods applied, and statistics on topic changes over approximately 1 year and two months period of time from 17th January 2020 to 13th March 2021. We hope this work can be an asset for grounding applications of topic modelling and can be inspiring for similar case studies in an era with pandemics, to support socio-economic impact research as well as clinical and healthcare analytics. Our data and source code are openly available at https://github. com/poethan/Swed_Covid_TM Keywords: Latent Dirichlet Allocation (LDA); Topic Modelling; Coronavirus; Pandemics; Natural Language Understanding; BERT-topic
翻译:主题建模(Topic Modelling,TM)是自然语言理解(Natural Language Understanding,NLU)和自然语言处理(Natural Language Processing,NLP)研究领域的一个分支,旨在促进对大规模文档和数据集的深入分析,例如主要主题的总结及主题变化。这类发现因其对大数据分析的影响而在实际应用中日渐流行。本研究从社交媒体和医疗健康领域出发,采用流行的潜在狄利克雷分配(Latent Dirichlet Allocation,LDA)方法,对瑞典语报纸中关于冠状病毒文章的主题变化进行建模。我们描述了所构建的包含6515篇文章的语料库、应用的方法,以及自2020年1月17日至2021年3月13日约一年零两个月期间主题变化的统计结果。我们希望这项工作能为主题建模的应用提供基础支撑,并在此流行病时代为类似案例研究提供启发,以支持社会经济影响研究及临床与医疗健康分析。我们的数据和源代码已公开于https://github.com/poethan/Swed_Covid_TM。关键词:潜在狄利克雷分配(LDA);主题建模;冠状病毒;流行病;自然语言理解;BERT-topic