Social media classification tasks (e.g., tweet sentiment analysis, tweet stance detection) are challenging because social media posts are typically short, informal, and ambiguous. Thus, training on tweets is challenging and demands large-scale human-annotated labels, which are time-consuming and costly to obtain. In this paper, we find that providing hashtags to social media tweets can help alleviate this issue because hashtags can enrich short and ambiguous tweets in terms of various information, such as topic, sentiment, and stance. This motivates us to propose a novel Hashtag-guided Tweet Classification model (HashTation), which automatically generates meaningful hashtags for the input tweet to provide useful auxiliary signals for tweet classification. To generate high-quality and insightful hashtags, our hashtag generation model retrieves and encodes the post-level and entity-level information across the whole corpus. Experiments show that HashTation achieves significant improvements on seven low-resource tweet classification tasks, in which only a limited amount of training data is provided, showing that automatically enriching tweets with model-generated hashtags could significantly reduce the demand for large-scale human-labeled data. Further analysis demonstrates that HashTation is able to generate high-quality hashtags that are consistent with the tweets and their labels. The code is available at https://github.com/shizhediao/HashTation.
翻译:摘要:社交媒体分类任务(如推文情感分析、推文立场检测)具有挑战性,原因在于社交媒体帖子通常简短、非正式且歧义性高。因此,在推文上进行训练面临困难,且需要大量人工标注的标签数据,而获取此类数据既耗时又昂贵。本文发现,为社交媒体推文添加标签有助于缓解这一问题,因为标签能从主题、情感、立场等多方面信息丰富简短且歧义性强的推文。这促使我们提出一种新颖的基于标签指导的推文分类模型(HashTation),该模型可自动为输入推文生成有意义的标签,为推文分类提供有用的辅助信号。为了生成高质量且具有洞察力的标签,我们的标签生成模型在整个语料库中检索并编码帖子级别和实体级别的信息。实验表明,在仅提供有限训练数据的七个低资源推文分类任务中,HashTation取得了显著改进,证明自动用模型生成的标签丰富推文能大幅减少对大规模人工标注数据的需求。进一步分析显示,HashTation能够生成与推文及其标签一致的高质量标签。代码开源地址为 https://github.com/shizhediao/HashTation。