Utilizing Large Language Models to Identify Reddit Users Considering Vaping Cessation for Digital Interventions

Sai Krishna Revanth Vuruma,Dezhi Wu,Saborny Sen Gupta,Lucas Aust,Valerie Lookingbill,Caleb Henry,Yang Ren,Erin Kasson,Li-Shiun Chen,Patricia Cavazos-Rehg,Dian Hu,Ming Huang

The widespread adoption of social media platforms globally not only enhances users' connectivity and communication but also emerges as a vital channel for the dissemination of health-related information, thereby establishing social media data as an invaluable organic data resource for public health research. The surge in popularity of vaping or e-cigarette use in the United States and other countries has caused an outbreak of e-cigarette and vaping use-associated lung injury (EVALI), leading to hospitalizations and fatalities in 2019, highlighting the urgency to comprehend vaping behaviors and develop effective strategies for cession. In this study, we extracted a sample dataset from one vaping sub-community on Reddit to analyze users' quit vaping intentions. Leveraging large language models including both the latest GPT-4 and traditional BERT-based language models for sentence-level quit-vaping intention prediction tasks, this study compares the outcomes of these models against human annotations. Notably, when compared to human evaluators, GPT-4 model demonstrates superior consistency in adhering to annotation guidelines and processes, showcasing advanced capabilities to detect nuanced user quit-vaping intentions that human evaluators might overlook. These preliminary findings emphasize the potential of GPT-4 in enhancing the accuracy and reliability of social media data analysis, especially in identifying subtle users' intentions that may elude human detection.

翻译：社交媒体平台在全球范围内的广泛普及不仅提升了用户的连接性与沟通能力，还成为了健康相关信息传播的重要渠道，从而使社交媒体数据成为公共卫生研究中宝贵的有机数据资源。美国及其他国家电子烟使用率的激增引发了2019年与电子烟和电子烟使用相关的肺损伤（EVALI）疫情，导致住院和死亡病例，凸显了理解电子烟行为并制定有效戒烟策略的紧迫性。在本研究中，我们从红迪（Reddit）上关于电子烟的子社区中提取了一个样本数据集，用于分析用户戒除电子烟的意图。本研究利用当前最新的GPT-4和基于BERT的传统语言模型等大型语言模型，开展句子级别的戒除电子烟意图预测任务，并将这些模型的结果与人工标注进行对比。值得注意的是，与人类评估者相比，GPT-4模型在遵循标注指南和流程方面表现出更优异的一致性，展示了检测人类评估者可能忽略的细微用户戒除电子烟意图的先进能力。这些初步发现强调了GPT-4在提升社交媒体数据分析准确性与可靠性方面的潜力，尤其在识别可能逃脱人类检测的细微用户意图方面。