Natural language understanding (NLU) is integral to various social media applications. However, existing NLU models rely heavily on context for semantic learning, resulting in compromised performance when faced with short and noisy social media content. To address this issue, we leverage in-context learning (ICL), wherein language models learn to make inferences by conditioning on a handful of demonstrations to enrich the context and propose a novel hashtag-driven in-context learning (HICL) framework. Concretely, we pre-train a model #Encoder, which employs #hashtags (user-annotated topic labels) to drive BERT-based pre-training through contrastive learning. Our objective here is to enable #Encoder to gain the ability to incorporate topic-related semantic information, which allows it to retrieve topic-related posts to enrich contexts and enhance social media NLU with noisy contexts. To further integrate the retrieved context with the source text, we employ a gradient-based method to identify trigger terms useful in fusing information from both sources. For empirical studies, we collected 45M tweets to set up an in-context NLU benchmark, and the experimental results on seven downstream tasks show that HICL substantially advances the previous state-of-the-art results. Furthermore, we conducted extensive analyzes and found that: (1) combining source input with a top-retrieved post from #Encoder is more effective than using semantically similar posts; (2) trigger words can largely benefit in merging context from the source and retrieved posts.
翻译:摘要:自然语言理解(NLU)是各类社交媒体应用的核心。然而,现有NLU模型过度依赖上下文进行语义学习,导致在面对简短且嘈杂的社交媒体内容时性能下降。为解决这一问题,我们利用上下文学习(ICL)——通过使语言模型基于少量示例进行推理以丰富上下文——提出了一种新颖的基于话题驱动的上下文学习(HICL)框架。具体而言,我们预训练了一个#Encoder模型,该模型利用#标签(用户标注的主题标签)通过对比学习驱动基于BERT的预训练。其目标是赋予#Encoder整合主题相关语义信息的能力,使其能够检索主题相关的帖子以丰富上下文,从而增强对含噪社交媒体内容的NLU。为了进一步将检索到的上下文与源文本融合,我们采用基于梯度的方法识别有助于融合两源信息的触发词。在实证研究中,我们收集了4500万条推文构建上下文NLU基准,在七个下游任务上的实验结果表明,HICL显著提升了先前最优结果。此外,我们通过广泛分析发现:(1)将源输入与#Encoder检索到的最高相关帖子结合比使用语义相似帖子更有效;(2)触发词能极大促进源上下文与检索帖子的信息融合。