An emerging trend on social media platforms is their use as safe spaces for peer support. Particularly in healthcare, where many medical conditions contain harsh stigmas, social media has become a stigma-free way to engage in dialogues regarding symptoms, treatments, and personal experiences. Many existing works have employed NLP algorithms to facilitate quantitative analysis of health trends. Notably absent from existing works are keyphrase extraction (KE) models for social health posts-a task crucial to discovering emerging public health trends. This paper presents a novel, theme-driven KE dataset, SuboxoPhrase, and a qualitative annotation scheme with an overarching goal of extracting targeted clinically-relevant keyphrases. To the best of our knowledge, this is the first study to design a KE schema for social media healthcare texts. To demonstrate the value of this approach, this study analyzes Reddit posts regarding medications for opioid use disorder, a paramount health concern worldwide. Additionally, we benchmark ten off-the-shelf KE models on our new dataset, demonstrating the unique extraction challenges in modeling user-generated health texts. The proposed theme-driven KE approach lays the foundation of future work on efficient, large-scale analysis of social health texts, allowing researchers to surface useful public health trends, patterns, and knowledge gaps.
翻译:社交媒体平台正逐渐成为同行支持的避风港,这一趋势日益显著。尤其是在医疗健康领域,许多疾病伴随着沉重的社会污名,而社交媒体为人们提供了一个无污名的交流空间,用以讨论症状、治疗方案和个人经历。现有研究多采用自然语言处理算法促进健康趋势的量化分析,但针对社交健康帖子的关键短语提取模型——这一对发现新兴公共卫生趋势至关重要的任务——尚属空白。本文提出了一种新颖的主题驱动型关键短语提取数据集SuboxoPhrase,并设计了定性标注方案,其核心目标是提取具有临床相关性的目标关键短语。据我们所知,这是首个针对社交媒体医疗文本设计关键短语提取模式的研究。为验证该方法的有效性,本研究分析了Reddit上关于阿片类药物使用障碍治疗药物的帖子——这一全球性重大健康问题。此外,我们在新数据集上对十种现有关键短语提取模型进行了基准测试,揭示了建模用户生成健康文本所面临的独特提取挑战。所提出的主题驱动型关键短语提取方法为未来高效、大规模分析社交健康文本奠定了基础,使研究者能够揭示有用的公共卫生趋势、模式及知识空白。