Theme-driven Keyphrase Extraction to Analyze Social Media Discourse

from arxiv, 11 pages, 2 figures, submitted to ICWSM. This version represents a substantial expansion and refocus of the previous manuscript, including new experiments, expanded data analysis, and comprehensive discussions

Social media platforms are vital resources for sharing self-reported health experiences, offering rich data on various health topics. Despite advancements in Natural Language Processing (NLP) enabling large-scale social media data analysis, a gap remains in applying keyphrase extraction to health-related content. Keyphrase extraction is used to identify salient concepts in social media discourse without being constrained by predefined entity classes. This paper introduces a theme-driven keyphrase extraction framework tailored for social media, a pioneering approach designed to capture clinically relevant keyphrases from user-generated health texts. Themes are defined as broad categories determined by the objectives of the extraction task. We formulate this novel task of theme-driven keyphrase extraction and demonstrate its potential for efficiently mining social media text for the use case of treatment for opioid use disorder. This paper leverages qualitative and quantitative analysis to demonstrate the feasibility of extracting actionable insights from social media data and efficiently extracting keyphrases using minimally supervised NLP models. Our contributions include the development of a novel data collection and curation framework for theme-driven keyphrase extraction and the creation of MOUD-Keyphrase, the first dataset of its kind comprising human-annotated keyphrases from a Reddit community. We also identify the scope of minimally supervised NLP models to extract keyphrases from social media data efficiently. Lastly, we found that a large language model (ChatGPT) outperforms unsupervised keyphrase extraction models, and we evaluate its efficacy in this task.

翻译：社交媒体平台是分享自我报告健康体验的重要资源，提供了关于多种健康主题的丰富数据。尽管自然语言处理（NLP）的进步使得大规模社交媒体数据分析成为可能，但在将关键词提取应用于健康相关内容方面仍存在空白。关键词提取用于识别社交媒体话语中的显著概念，而不受预定义实体类别的限制。本文介绍了一种针对社交媒体定制的主题驱动关键词提取框架，这是一种开创性方法，旨在从用户生成的健康文本中捕获临床相关的关键词。主题被定义为由提取任务目标决定的广泛类别。我们提出了这一新颖的主题驱动关键词提取任务，并展示了其在高效挖掘社交媒体文本用于阿片类药物使用障碍治疗案例中的潜力。本文利用定性和定量分析，证明了从社交媒体数据中提取可操作见解的可行性，以及使用最小监督NLP模型高效提取关键词的能力。我们的贡献包括开发了一种用于主题驱动关键词提取的新型数据收集与整理框架，并创建了MOUD-Keyphrase数据集，这是首个包含来自Reddit社区人工标注关键词的此类数据集。我们还确定了最小监督NLP模型从社交媒体数据中高效提取关键词的应用范围。最后，我们发现大型语言模型（ChatGPT）在性能上优于无监督关键词提取模型，并评估了其在该任务中的有效性。