Homelessness is a persistent social challenge, impacting millions worldwide. Over 876,000 people experienced homelessness (PEH) in the U.S. in 2025. Social bias is a significant barrier to alleviation, shaping public perception and influencing policymaking. Given that online textual media and offline city council discourse reflect and influence part of public opinion, it provides valuable insights to identify and track social biases against PEH. We present a new, manually-annotated multi-domain dataset compiled from Reddit, X (formerly Twitter), news articles, and city council meeting minutes across ten U.S. cities. Our 16-category multi-label taxonomy creates a challenging long-tail classification problem: some categories appear in less than 1% of samples, while others exceed 70%. We find that small human-annotated datasets (1,702 samples) are insufficient for training effective classifiers, whether used to fine-tune encoder models or as few-shot examples for LLMs. To address this, we use GPT-4.1 to generate pseudo-labels on a larger unlabeled corpus. Training on this expanded dataset enables even small encoder models (ModernBERT, 150M parameters) to achieve 35.23 macro-F1, approaching GPT-4.1's 41.57. This demonstrates that \textbf{data quantity matters more than model size}, enabling low-cost, privacy-preserving deployment without relying on commercial APIs. Our results reveal that negative bias against PEH is prevalent both offline and online (especially on Reddit), with "not in my backyard" narratives showing the highest engagement. These findings uncover a type of ostracism that directly impacts poverty-reduction policymaking and provide actionable insights for practitioners addressing homelessness.
翻译:无家可归是一个持续存在的社会挑战,影响着全球数百万人。2025年,美国有超过87.6万人经历过无家可归。社会偏见是缓解这一问题的重要障碍,它塑造公众认知并影响政策制定。鉴于线上文本媒体和线下市议会讨论反映并影响着部分公众意见,这为识别和追踪针对无家可归者的社会偏见提供了宝贵视角。我们提出了一个新的、人工标注的多领域数据集,该数据集汇编自Reddit、X(原Twitter)、新闻文章以及美国十个城市的市议会会议记录。我们设计的16类别多标签分类体系构成了一个具有挑战性的长尾分类问题:某些类别在样本中出现比例不足1%,而其他类别则超过70%。我们发现,小型人工标注数据集(1,702个样本)不足以训练有效的分类器,无论是用于微调编码器模型还是作为大语言模型的少样本示例。为解决此问题,我们使用GPT-4.1在更大的未标注语料上生成伪标签。在此扩展数据集上进行训练后,即使是小型编码器模型(ModernBERT,1.5亿参数)也能达到35.23的宏观F1分数,接近GPT-4.1的41.57分。这证明了**数据量比模型规模更重要**,使得无需依赖商业API即可实现低成本、保护隐私的部署。我们的研究结果表明,针对无家可归者的负面偏见在线下和线上(尤其是在Reddit上)普遍存在,其中“别在我家后院”类叙事获得的互动度最高。这些发现揭示了一种直接影响减贫政策制定的排斥现象,并为解决无家可归问题的实践者提供了可操作的见解。