Deep learning techniques have gained a lot of traction in the field of NLP research. The aim of this paper is to predict the age and gender of an individual by inspecting their written text. We propose a supervised BERT-based classification technique in order to predict the age and gender of bloggers. The dataset used contains 681284 rows of data, with the information of the blogger's age, gender, and text of the blog written by them. We compare our algorithm to previous works in the same domain and achieve a better accuracy and F1 score. The accuracy reported for the prediction of age group was 84.2%, while the accuracy for the prediction of gender was 86.32%. This study relies on the raw capabilities of BERT to predict the classes of textual data efficiently. This paper shows promising capability in predicting the demographics of the author with high accuracy and can have wide applicability across multiple domains.
翻译:深度学习技术在自然语言处理研究中已获得广泛关注。本文旨在通过分析个人撰写的文本来预测其年龄与性别。我们提出一种基于BERT的有监督分类技术,用于预测博主的年龄与性别。所用数据集包含681,284行数据,涵盖博主的年龄、性别及其博客文本信息。我们将所提算法与该领域先前工作进行比较,在准确率和F1分数上均取得更优表现:年龄分组预测准确率达84.2%,性别预测准确率达86.32%。本研究依托BERT的原始能力高效实现文本数据分类。本文展示了在作者人口统计学信息高精度预测方面的显著潜力,可在多个领域具有广泛适用性。