LLMs have demonstrated exceptional proficiency in a wide range of NLP tasks. However, a notable gap remains in practical data analysis scenarios, particularly when LLMs are required to process long sequences of unstructured documents, such as news feeds or, as specifically addressed in this paper, social media posts. To empirically assess the effectiveness of LLMs in this setting, we introduce a question-based evaluation framework comprising 470 manually curated questions designed to evaluate LLMs' semantic understanding and reasoning abilities over aggregated text data. We apply our benchmark on diverse Twitter datasets covering various NLP tasks, including sentiment analysis, hate speech detection, and emotion recognition. Our results reveal that the performance depends heavily on input scale and the complexity of the data sources, declining noticeably in multi-label or target-dependent scenarios. In addition, as task complexity increases, performance drops progressively from basic semantic existence identification to more demanding operations such as comparison, counting, and calculation. Furthermore, as the input size grows beyond 500 instances, we identify a common limitation across LLMs, particularly Open-weights models: performance degrades substantially, especially on numerical tasks. These findings highlight critical architectural bottlenecks in current LLMs for performing rigorous quantitative analysis over large text collections.
翻译:大语言模型(LLMs)在广泛自然语言处理任务中展现出卓越能力。然而,在实际数据分析场景中仍存在显著不足,尤其是当LLMs需要处理长序列非结构化文档(如新闻推送或本文重点关注的社交媒体帖子)时。为实证评估LLMs在此情境下的有效性,我们提出一种基于问题的评估框架,包含470个经人工精心构建的问题,用以评估LLMs对聚合文本数据的语义理解与推理能力。我们在涵盖情感分析、仇恨言论检测及情绪识别等多项自然语言处理任务的多样化推特数据集上应用该基准测试。结果表明:模型性能严重依赖输入规模与数据源复杂度,在多标签或目标依赖场景中显著下降。此外,随任务复杂度递增,模型性能从基础语义存在性识别渐次下降至比较、计数及计算等高要求操作。当输入规模超过500条实例时,我们发现所有LLMs(特别是开放权重模型)的共同局限:数值类任务的性能大幅衰减。这些发现揭示了当前LLMs在对大规模文本集合进行严谨定量分析时存在的关键架构瓶颈。