In recent years, the extraction of opinions and information from user-generated text has attracted a lot of interest, largely due to the unprecedented volume of content in Social Media. However, social researchers face some issues in adopting cutting-edge tools for these tasks, as they are usually behind commercial APIs, unavailable for other languages than English, or very complex to use for non-experts. To address these issues, we present pysentimiento, a comprehensive multilingual Python toolkit designed for opinion mining and other Social NLP tasks. This open-source library brings state-of-the-art models for Spanish, English, Italian, and Portuguese in an easy-to-use Python library, allowing researchers to leverage these techniques. We present a comprehensive assessment of performance for several pre-trained language models across a variety of tasks, languages, and datasets, including an evaluation of fairness in the results.
翻译:近年来,从用户生成文本中提取观点和信息的任务引起了广泛关注,这主要得益于社交媒体上前所未有的内容规模。然而,社会科学研究者在采用这些任务的前沿工具时面临诸多困难:这些工具通常被商业API封装、仅支持英语等少数语言,或对非专业人员而言使用过于复杂。为解决这些问题,我们提出pysentimiento——一个面向观点挖掘及其他社交NLP任务的综合性多语言Python工具包。作为开源库,它以便捷的Python库形式为西班牙语、英语、意大利语和葡萄牙语提供了当前最先进的模型,使研究者能够直接利用这些技术。我们针对多个预训练语言模型在跨任务、跨语言和跨数据集场景下的性能进行了全面评估,并包含了对结果公平性的分析。