Much of the research in social computing analyzes data from social media platforms, which may inherently carry biases. An overlooked source of such bias is the over-representation of WEIRD (Western, Educated, Industrialized, Rich, and Democratic) populations, which might not accurately mirror the global demographic diversity. We evaluated the dependence on WEIRD populations in research presented at the AAAI ICWSM conference; the only venue whose proceedings are fully dedicated to social computing research. We did so by analyzing 494 papers published from 2018 to 2022, which included full research papers, dataset papers and posters. After filtering out papers that analyze synthetic datasets or those lacking clear country of origin, we were left with 420 papers from which 188 participants in a crowdsourcing study with full manual validation extracted data for the WEIRD scores computation. This data was then used to adapt existing WEIRD metrics to be applicable for social media data. We found that 37% of these papers focused solely on data from Western countries. This percentage is significantly less than the percentages observed in research from CHI (76%) and FAccT (84%) conferences, suggesting a greater diversity of dataset origins within ICWSM. However, the studies at ICWSM still predominantly examine populations from countries that are more Educated, Industrialized, and Rich in comparison to those in FAccT, with a special note on the 'Democratic' variable reflecting political freedoms and rights. This points out the utility of social media data in shedding light on findings from countries with restricted political freedoms. Based on these insights, we recommend extensions of current "paper checklists" to include considerations about the WEIRD bias and call for the community to broaden research inclusivity by encouraging the use of diverse datasets from underrepresented regions.
翻译:社交计算领域的大量研究分析来自社交媒体平台的数据,这些数据可能天然带有偏差。一个常被忽视的偏差来源是WEIRD(西方化、教育化、工业化、富裕化与民主化)人群的过度代表,这可能无法准确反映全球人口多样性。我们评估了AAAI ICWSM会议上展示的研究对WEIRD人群的依赖程度;该会议是唯一一个论文集完全专注于社交计算研究的学术场合。我们通过分析2018年至2022年间发表的494篇论文(包括完整研究论文、数据集论文和海报)来实现这一目标。在过滤掉分析合成数据集或缺乏明确来源国家的论文后,我们保留了420篇论文,并组织众包研究的188名参与者通过全人工验证从中提取数据,用于计算WEIRD分数。这些数据随后被用于调整现有的WEIRD指标,使其适用于社交媒体数据。我们发现,其中37%的论文仅关注西方国家数据。这一比例显著低于CHI(76%)和FAccT(84%)会议研究中观察到的比例,表明ICWSM内部数据集来源具有更高的多样性。然而,ICWSM的研究仍然主要考察来自教育程度更高、工业化程度更高且更富裕国家的人群(与FAccT相比),其中特别注意到“民主”变量反映了政治自由与权利。这指出了社交媒体数据在揭示政治自由受限国家的研究发现方面的效用。基于这些见解,我们建议扩展当前的“论文检查清单”,纳入对WEIRD偏差的考量,并呼吁学术界通过鼓励使用来自代表性不足地区的多样化数据集来拓宽研究的包容性。