Transit Pulse: Utilizing Social Media as a Source for Customer Feedback and Information Extraction with Large Language Model

Users of the transit system flood social networks daily with messages that contain valuable insights crucial for improving service quality. These posts help transit agencies quickly identify emerging issues. Parsing topics and sentiments is key to gaining comprehensive insights to foster service excellence. However, the volume of messages makes manual analysis impractical, and standard NLP techniques like Term Frequency-Inverse Document Frequency (TF-IDF) fall short in nuanced interpretation. Traditional sentiment analysis separates topics and sentiments before integrating them, often missing the interaction between them. This incremental approach complicates classification and reduces analytical productivity. To address these challenges, we propose a novel approach to extracting and analyzing transit-related information, including sentiment and sarcasm detection, identification of unusual system problems, and location data from social media. Our method employs Large Language Models (LLM), specifically Llama 3, for a streamlined analysis free from pre-established topic labels. To enhance the model's domain-specific knowledge, we utilize Retrieval-Augmented Generation (RAG), integrating external knowledge sources into the information extraction pipeline. We validated our method through extensive experiments comparing its performance with traditional NLP approaches on user tweet data from the real world transit system. Our results demonstrate the potential of LLMs to transform social media data analysis in the public transit domain, providing actionable insights and enhancing transit agencies' responsiveness by extracting a broader range of information.

翻译：公共交通系统的用户每天在社交网络上发布大量包含宝贵见解的信息，这些信息对提升服务质量至关重要。这些帖子有助于交通机构快速识别新出现的问题。解析主题与情感是获得全面洞察以促进服务优化的关键。然而，信息量之大使得人工分析不切实际，而术语频率-逆文档频率（TF-IDF）等标准自然语言处理技术在细微解读方面存在不足。传统情感分析通常在整合前分离主题与情感，往往忽略了二者间的相互作用。这种增量式方法使分类复杂化并降低了分析效率。为应对这些挑战，我们提出一种新颖的方法来提取和分析公共交通相关信息，包括情感与讽刺检测、异常系统问题识别以及社交媒体中的位置数据。我们的方法采用大型语言模型（LLM），特别是Llama 3，进行无需预设主题标签的流线型分析。为增强模型的领域特定知识，我们利用检索增强生成（RAG）技术，将外部知识源整合到信息提取流程中。我们通过大量实验验证了该方法，将其在真实世界公共交通系统的用户推文数据上的性能与传统自然语言处理方法进行比较。结果表明，大型语言模型在公共交通领域具有变革社交媒体数据分析的潜力，通过提取更广泛的信息提供可操作的见解，并增强交通机构的响应能力。