"Not in My Backyard": LLMs Uncover Online and Offline Social Biases Against Homelessnes

Homelessness is a persistent social challenge, impacting millions worldwide. Over 876,000 people experienced homelessness (PEH) in the U.S. in 2025. Social bias is a significant barrier to alleviation, shaping public perception and influencing policymaking. Given that online textual media and offline city council discourse reflect and influence part of public opinion, it provides valuable insights to identify and track social biases against PEH. We present a new, manually-annotated multi-domain dataset compiled from Reddit, X (formerly Twitter), news articles, and city council meeting minutes across ten U.S. cities. Our 16-category multi-label taxonomy creates a challenging long-tail classification problem: some categories appear in less than 1% of samples, while others exceed 70%. We find that small human-annotated datasets (1,702 samples) are insufficient for training effective classifiers, whether used to fine-tune encoder models or as few-shot examples for LLMs. To address this, we use GPT-4.1 to generate pseudo-labels on a larger unlabeled corpus. Training on this expanded dataset enables even small encoder models (ModernBERT, 150M parameters) to achieve 35.23 macro-F1, approaching GPT-4.1's 41.57. This demonstrates that \textbf{data quantity matters more than model size}, enabling low-cost, privacy-preserving deployment without relying on commercial APIs. Our results reveal that negative bias against PEH is prevalent both offline and online (especially on Reddit), with "not in my backyard" narratives showing the highest engagement. These findings uncover a type of ostracism that directly impacts poverty-reduction policymaking and provide actionable insights for practitioners addressing homelessness.

翻译：无家可归是一个持续存在的社会挑战，影响着全球数百万人。2025年，美国有超过87.6万人经历过无家可归。社会偏见是缓解这一问题的重要障碍，它塑造公众认知并影响政策制定。鉴于线上文本媒体和线下市议会讨论反映并影响着部分公众意见，这为识别和追踪针对无家可归者的社会偏见提供了宝贵视角。我们提出了一个新的、人工标注的多领域数据集，该数据集汇编自Reddit、X（原Twitter）、新闻文章以及美国十个城市的市议会会议记录。我们设计的16类别多标签分类体系构成了一个具有挑战性的长尾分类问题：某些类别在样本中出现比例不足1%，而其他类别则超过70%。我们发现，小型人工标注数据集（1,702个样本）不足以训练有效的分类器，无论是用于微调编码器模型还是作为大语言模型的少样本示例。为解决此问题，我们使用GPT-4.1在更大的未标注语料上生成伪标签。在此扩展数据集上进行训练后，即使是小型编码器模型（ModernBERT，1.5亿参数）也能达到35.23的宏观F1分数，接近GPT-4.1的41.57分。这证明了**数据量比模型规模更重要**，使得无需依赖商业API即可实现低成本、保护隐私的部署。我们的研究结果表明，针对无家可归者的负面偏见在线下和线上（尤其是在Reddit上）普遍存在，其中“别在我家后院”类叙事获得的互动度最高。这些发现揭示了一种直接影响减贫政策制定的排斥现象，并为解决无家可归问题的实践者提供了可操作的见解。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

大型社会模拟器：前沿与展望

专知会员服务

14+阅读 · 2025年5月16日

《数据驱动型危机决策中的偏见和去偏见》2023最新27页博士论文

专知会员服务

30+阅读 · 2023年9月5日

【博士论文】《多智能体辩论的集体推理：一种连贯方法》2022最新377页，伦敦大学国王学院

专知会员服务

25+阅读 · 2022年9月21日

【康奈尔经典书】图论与网络模型，833页pdf阐述网络、人群和市场:关于高度连接世界的推理

专知会员服务

67+阅读 · 2022年6月16日