Optimizing national scientific investment requires a clear understanding of evolving research trends and the demographic and geographical forces shaping them, particularly in light of commitments to equity, diversity, and inclusion. This study addresses this need by analyzing 18 years (2005-2022) of research proposals funded by the Natural Sciences and Engineering Research Council of Canada (NSERC). We conducted a comprehensive comparative evaluation of three topic modelling approaches: Latent Dirichlet Allocation (LDA), Structural Topic Modelling (STM), and BERTopic. We also introduced a novel algorithm, named COFFEE, designed to enable robust covariate effect estimation for BERTopic. This advancement addresses a significant gap, as BERTopic lacks a native function for covariate analysis, unlike the probabilistic STM. Our findings highlight that while all models effectively delineate core scientific domains, BERTopic outperformed by consistently identifying more granular, coherent, and emergent themes, such as the rapid expansion of artificial intelligence. Additionally, the covariate analysis, powered by COFFEE, confirmed distinct provincial research specializations and revealed consistent gender-based thematic patterns across various scientific disciplines. These insights offer a robust empirical foundation for funding organizations to formulate more equitable and impactful funding strategies, thereby enhancing the effectiveness of the scientific ecosystem.
翻译:优化国家科研投资需要清晰理解不断演进的研究趋势以及塑造这些趋势的人口与地理因素,这尤其在致力于公平、多样性与包容性的背景下至关重要。本研究通过分析加拿大自然科学与工程研究理事会(NSERC)2005年至2022年共18年的资助研究提案来应对这一需求。我们对三种主题建模方法进行了全面的比较评估:潜在狄利克雷分配(LDA)、结构主题模型(STM)以及BERTopic。同时,我们提出了一种名为COFFEE的新算法,旨在为BERTopic实现稳健的协变量效应估计。这一进展弥补了重要空白,因为与概率性的STM不同,BERTopic本身缺乏协变量分析功能。我们的研究结果表明,尽管所有模型都能有效勾勒核心科学领域,但BERTopic表现更优,能够持续识别出更精细、更连贯且更具新兴性的主题,例如人工智能的快速扩张。此外,借助COFFEE驱动的协变量分析,我们确认了各省份独特的研究专长,并揭示了跨不同科学学科中一致的基于性别的主题模式。这些见解为资助机构制定更公平、更具影响力的资助策略提供了坚实的实证基础,从而提升科学生态系统的效能。