Streamlining Social Media Information Extraction for Public Health Research with Deep Learning

Objective: Social media-based public health research is crucial for epidemic surveillance, but most studies identify relevant corpora with keyword matching. This study develops a system to streamline the process of curating colloquial medical dictionaries. We demonstrate the pipeline by curating a UMLS-colloquial symptom dictionary from COVID-19-related tweets as proof of concept. Methods: COVID-19-related tweets from February 1, 2020, to April 30, 2022 were used. The pipeline includes three modules: a named entity recognition module to detect symptoms in tweets; an entity normalization module to aggregate detected entities; and a mapping module that iteratively maps entities to Unified Medical Language System concepts. A random 500 entity sample were drawn from the final dictionary for accuracy validation. Additionally, we conducted a symptom frequency distribution analysis to compare our dictionary to a pre-defined lexicon from previous research. Results: We identified 498,480 unique symptom entity expressions from the tweets. Pre-processing reduces the number to 18,226. The final dictionary contains 38,175 unique expressions of symptoms that can be mapped to 966 UMLS concepts (accuracy = 95%). Symptom distribution analysis found that our dictionary detects more symptoms and is effective at identifying psychiatric disorders like anxiety and depression, often missed by pre-defined lexicons. Conclusion: This study advances public health research by implementing a novel, systematic pipeline for curating symptom lexicons from social media data. The final lexicon's high accuracy, validated by medical professionals, underscores the potential of this methodology to reliably interpret and categorize vast amounts of unstructured social media data into actionable medical insights across diverse linguistic and regional landscapes.

翻译：目的：基于社交媒体的公共卫生研究对疫情监测至关重要，但多数研究依赖关键词匹配识别相关语料。本研究开发了一套系统，用以优化非正式医学术语词典的构建流程。我们以新冠肺炎相关推文为概念验证，通过该流程构建了统一医学语言系统（UMLS）-非正式症状词典。方法：采用2020年2月1日至2022年4月30日期间的新冠肺炎相关推文。该流程包含三个模块：命名实体识别模块（检测推文中的症状）、实体标准化模块（聚合检测到的实体）及映射模块（迭代将实体映射至统一医学语言系统概念）。从最终词典中随机抽取500个实体样本进行准确性验证，并通过症状频率分布分析，将本词典与既往研究的预定义词典进行对比。结果：从推文中识别出498,480个独特症状实体表达，经预处理减少至18,226个。最终词典收录38,175种可映射至966个UMLS概念的症状独特表达（准确率95%）。症状分布分析表明，本词典可检测更多症状类型，且能有效识别焦虑、抑郁等常被预定义词典遗漏的精神类疾病。结论：本研究通过构建基于社交媒体数据的创新型系统性症状词典流程，推动了公共卫生研究发展。经医学专家验证，该词典的高准确率凸显了该方法在跨语言及区域环境下，将海量非结构化社交媒体数据可靠解读并分类为可操作医疗见解的巨大潜力。