QuaLLM-Health: An Adaptation of an LLM-Based Framework for Quantitative Data Extraction from Online Health Discussions

Health-related discussions on social media like Reddit offer valuable insights, but extracting quantitative data from unstructured text is challenging. In this work, we present an adapted framework from QuaLLM into QuaLLM-Health for extracting clinically relevant quantitative data from Reddit discussions about glucagon-like peptide-1 (GLP-1) receptor agonists using large language models (LLMs). We collected 410k posts and comments from five GLP-1-related communities using the Reddit API in July 2024. After filtering for cancer-related discussions, 2,059 unique entries remained. We developed annotation guidelines to manually extract variables such as cancer survivorship, family cancer history, cancer types mentioned, risk perceptions, and discussions with physicians. Two domain-experts independently annotated a random sample of 100 entries to create a gold-standard dataset. We then employed iterative prompt engineering with OpenAI's "GPT-4o-mini" on the gold-standard dataset to build an optimized pipeline that allowed us to extract variables from the large dataset. The optimized LLM achieved accuracies above 0.85 for all variables, with precision, recall and F1 score macro averaged > 0.90, indicating balanced performance. Stability testing showed a 95% match rate across runs, confirming consistency. Applying the framework to the full dataset enabled efficient extraction of variables necessary for downstream analysis, costing under $3 and completing in approximately one hour. QuaLLM-Health demonstrates that LLMs can effectively and efficiently extract clinically relevant quantitative data from unstructured social media content. Incorporating human expertise and iterative prompt refinement ensures accuracy and reliability. This methodology can be adapted for large-scale analysis of patient-generated data across various health domains, facilitating valuable insights for healthcare research.

翻译：Reddit等社交媒体上的健康相关讨论提供了宝贵的见解，但从非结构化文本中提取定量数据具有挑战性。本研究提出了一种从QuaLLM框架改进而来的QuaLLM-Health框架，利用大语言模型从关于胰高血糖素样肽-1（GLP-1）受体激动剂的Reddit讨论中提取临床相关的定量数据。我们于2024年7月通过Reddit API收集了来自五个GLP-1相关社区的41万条帖子和评论。经过筛选去除癌症相关讨论后，保留了2,059条独立条目。我们制定了标注指南，用于人工提取变量，包括癌症生存状况、家族癌症史、提及的癌症类型、风险认知以及与医生的讨论。两位领域专家独立标注了100条随机样本条目，创建了一个黄金标准数据集。随后，我们在该黄金标准数据集上使用OpenAI的"GPT-4o-mini"进行迭代式提示工程，构建了一个优化的处理流程，从而能够从大规模数据集中提取变量。优化后的大语言模型对所有变量的准确率均超过0.85，精确率、召回率和F1分数的宏平均值均大于0.90，表明其性能均衡。稳定性测试显示跨运行匹配率达到95%，证实了一致性。将该框架应用于完整数据集，能够高效提取下游分析所需的变量，成本低于3美元且耗时约一小时。QuaLLM-Health证明了大语言模型能够有效且高效地从非结构化社交媒体内容中提取临床相关的定量数据。结合人类专业知识和迭代式提示优化确保了准确性与可靠性。该方法可适用于跨不同健康领域的患者生成数据的大规模分析，为医疗健康研究提供有价值的见解。