For the Underrepresented in Gender Bias Research: Chinese Name Gender Prediction with Heterogeneous Graph Attention Network

Achieving gender equality is an important pillar for humankind's sustainable future. Pioneering data-driven gender bias research is based on large-scale public records such as scientific papers, patents, and company registrations, covering female researchers, inventors and entrepreneurs, and so on. Since gender information is often missing in relevant datasets, studies rely on tools to infer genders from names. However, available open-sourced Chinese gender-guessing tools are not yet suitable for scientific purposes, which may be partially responsible for female Chinese being underrepresented in mainstream gender bias research and affect their universality. Specifically, these tools focus on character-level information while overlooking the fact that the combinations of Chinese characters in multi-character names, as well as the components and pronunciations of characters, convey important messages. As a first effort, we design a Chinese Heterogeneous Graph Attention (CHGAT) model to capture the heterogeneity in component relationships and incorporate the pronunciations of characters. Our model largely surpasses current tools and also outperforms the state-of-the-art algorithm. Last but not least, the most popular Chinese name-gender dataset is single-character based with far less female coverage from an unreliable source, naturally hindering relevant studies. We open-source a more balanced multi-character dataset from an official source together with our code, hoping to help future research promoting gender equality.

翻译：实现性别平等是人类可持续发展的关键基石。开创性的数据驱动性别偏见研究依赖于大规模公共记录（如科学论文、专利和公司注册信息），涵盖女性研究人员、发明家及企业家等群体。由于相关数据集中常缺失性别信息，研究依赖工具从姓名推断性别。然而，现有开源中文性别推断工具尚不适用于科研场景，这可能是导致中国女性在主流性别偏见研究中代表性不足、影响研究普适性的部分原因。具体而言，这些工具聚焦于字符级信息，却忽视了多字姓名中汉字组合、汉字部件及其发音所传递的重要语义。作为首次尝试，我们设计了中文异构图注意力（CHGAT）模型，以捕获部件关系的异质性并整合汉字发音。该模型大幅超越现有工具，同时优于当前最优算法。此外，最流行的中文姓名-性别数据集基于单字构建，且女性样本覆盖率极低、来源不可靠，客观上阻碍了相关研究发展。我们以官方来源为基础开源了更均衡的多字姓名数据集及配套代码，期望为促进性别平等的未来研究提供助力。

相关内容

TOOLS

关注 1

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

计算机科学课程与视频课件合集，Computer Science courses with video lectures

专知会员服务

37+阅读 · 2022年1月24日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日