解码资助研究：主题模型的比较分析与性别及地理区位影响的揭示 (Decoding Funded Research: Comparative Analysis of Topic Models and Uncovering the Effect of Gender and Geographic Location)

Optimizing national scientific investment requires a clear understanding of evolving research trends and the demographic and geographical forces shaping them, particularly in light of commitments to equity, diversity, and inclusion. This study addresses this need by analyzing 18 years (2005-2022) of research proposals funded by the Natural Sciences and Engineering Research Council of Canada (NSERC). We conducted a comprehensive comparative evaluation of three topic modelling approaches: Latent Dirichlet Allocation (LDA), Structural Topic Modelling (STM), and BERTopic. We also introduced a novel algorithm, named COFFEE, designed to enable robust covariate effect estimation for BERTopic. This advancement addresses a significant gap, as BERTopic lacks a native function for covariate analysis, unlike the probabilistic STM. Our findings highlight that while all models effectively delineate core scientific domains, BERTopic outperformed by consistently identifying more granular, coherent, and emergent themes, such as the rapid expansion of artificial intelligence. Additionally, the covariate analysis, powered by COFFEE, confirmed distinct provincial research specializations and revealed consistent gender-based thematic patterns across various scientific disciplines. These insights offer a robust empirical foundation for funding organizations to formulate more equitable and impactful funding strategies, thereby enhancing the effectiveness of the scientific ecosystem.

翻译：优化国家科研投资需要清晰理解不断演进的研究趋势以及塑造这些趋势的人口与地理因素，这尤其在致力于公平、多样性与包容性的背景下至关重要。本研究通过分析加拿大自然科学与工程研究理事会（NSERC）2005年至2022年共18年的资助研究提案来应对这一需求。我们对三种主题建模方法进行了全面的比较评估：潜在狄利克雷分配（LDA）、结构主题模型（STM）以及BERTopic。同时，我们提出了一种名为COFFEE的新算法，旨在为BERTopic实现稳健的协变量效应估计。这一进展弥补了重要空白，因为与概率性的STM不同，BERTopic本身缺乏协变量分析功能。我们的研究结果表明，尽管所有模型都能有效勾勒核心科学领域，但BERTopic表现更优，能够持续识别出更精细、更连贯且更具新兴性的主题，例如人工智能的快速扩张。此外，借助COFFEE驱动的协变量分析，我们确认了各省份独特的研究专长，并揭示了跨不同科学学科中一致的基于性别的主题模式。这些见解为资助机构制定更公平、更具影响力的资助策略提供了坚实的实证基础，从而提升科学生态系统的效能。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

31+阅读 · 2021年9月29日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日