Streamlining Social Media Information Retrieval for COVID-19 Research with Deep Learning

Objective: Social media-based public health research is crucial for epidemic surveillance, but most studies identify relevant corpora with keyword-matching. This study develops a system to streamline the process of curating colloquial medical dictionaries. We demonstrate the pipeline by curating a UMLS-colloquial symptom dictionary from COVID-19-related tweets as proof of concept. Methods: COVID-19-related tweets from February 1, 2020, to April 30, 2022 were used. The pipeline includes three modules: a named entity recognition module to detect symptoms in tweets; an entity normalization module to aggregate detected entities; and a mapping module that iteratively maps entities to Unified Medical Language System concepts. A random 500 entity sample were drawn from the final dictionary for accuracy validation. Additionally, we conducted a symptom frequency distribution analysis to compare our dictionary to a pre-defined lexicon from previous research. Results: We identified 498,480 unique symptom entity expressions from the tweets. Pre-processing reduces the number to 18,226. The final dictionary contains 38,175 unique expressions of symptoms that can be mapped to 966 UMLS concepts (accuracy = 95%). Symptom distribution analysis found that our dictionary detects more symptoms and is effective at identifying psychiatric disorders like anxiety and depression, often missed by pre-defined lexicons. Conclusions: This study advances public health research by implementing a novel, systematic pipeline for curating symptom lexicons from social media data. The final lexicon's high accuracy, validated by medical professionals, underscores the potential of this methodology to reliably interpret and categorize vast amounts of unstructured social media data into actionable medical insights across diverse linguistic and regional landscapes.

翻译：目的：基于社交媒体的公共卫生研究对疫情监测至关重要，但多数研究仍依赖关键词匹配识别相关语料库。本研究开发了一套系统，以简化口语化医学词典的编制流程。我们以COVID-19相关推文为概念验证，通过该流程编制了统一医学语言系统（UMLS）-口语症状词典。方法：采用2020年2月1日至2022年4月30日期间的COVID-19相关推文。该流程包含三个模块：命名实体识别模块用于检测推文中的症状；实体标准化模块用于聚合检测实体；映射模块通过迭代将实体映射至统一医学语言系统概念。从最终词典中随机抽取500个实体样本进行准确性验证。此外，我们通过症状频率分布分析，将本词典与既往研究的预定义词汇表进行对比。结果：从推文中识别出498,480个独特症状实体表达，预处理后缩减至18,226个。最终词典包含38,175个可映射至966个UMLS概念的症状独特表达（准确率=95%）。症状分布分析发现，本词典能检测更多症状，且有效识别焦虑、抑郁等常被预定义词汇表遗漏的精神疾病。结论：本研究通过实施从社交媒体数据编制症状词汇表的创新系统化流程，推动了公共卫生研究发展。经医学专业人员验证的高准确率最终词汇表，凸显了该方法论在不同语言和地域背景下，将海量非结构化社交媒体数据可靠转化为可操作医学见解的潜力。