As social media becomes increasingly popular, more and more public health activities emerge, which is worth noting for pandemic monitoring and government decision-making. Current techniques for public health analysis involve popular models such as BERT and large language models (LLMs). Although recent progress in LLMs has shown a strong ability to comprehend knowledge by being fine-tuned on specific domain datasets, the costs of training an in-domain LLM for every specific public health task are especially expensive. Furthermore, such kinds of in-domain datasets from social media are generally highly imbalanced, which will hinder the efficiency of LLMs tuning. To tackle these challenges, the data imbalance issue can be overcome by sophisticated data augmentation methods for social media datasets. In addition, the ability of the LLMs can be effectively utilised by prompting the model properly. In light of the above discussion, in this paper, a novel ALEX framework is proposed for social media analysis on public health. Specifically, an augmentation pipeline is developed to resolve the data imbalance issue. Furthermore, an LLMs explanation mechanism is proposed by prompting an LLM with the predicted results from BERT models. Extensive experiments conducted on three tasks at the Social Media Mining for Health 2023 (SMM4H) competition with the first ranking in two tasks demonstrate the superior performance of the proposed ALEX method. Our code has been released in https://github.com/YanJiangJerry/ALEX.
翻译:随着社交媒体日益普及,越来越多的公共卫生活动涌现,这在疫情监测和政府决策中值得关注。当前公共卫生分析技术涉及BERT和大语言模型等流行模型。尽管大语言模型的最新进展通过领域特定数据集的微调展现出强大的知识理解能力,但为每项特定公共卫生任务训练领域内大语言模型的成本极其高昂。此外,来自社交媒体的此类领域内数据集通常高度不平衡,这将阻碍大语言模型调优的效率。为应对这些挑战,可通过复杂的数据增强方法克服社交媒体数据集的不平衡问题。同时,通过适当提示模型可有效利用大语言模型的能力。基于上述讨论,本文提出一种新颖的ALEX框架用于公共卫生社交媒体分析。具体而言,开发了一条增强流水线以解决数据不平衡问题;进一步提出大语言模型解释机制,通过用BERT模型的预测结果提示大语言模型实现。在2023年健康领域社交媒体挖掘竞赛的三个任务上开展的广泛实验(其中两个任务排名第一)证明了所提出的ALEX方法的优越性能。我们的代码已发布在https://github.com/YanJiangJerry/ALEX。