We summarize our exploratory investigation into whether Machine Learning (ML) techniques applied to publicly available professional text can substantially augment strategic planning for astronomy. We find that an approach based on Latent Dirichlet Allocation (LDA) using content drawn from astronomy journal papers can be used to infer high-priority research areas. While the LDA models are challenging to interpret, we find that they may be strongly associated with meaningful keywords and scientific papers which allow for human interpretation of the topic models. Significant correlation is found between the results of applying these models to the previous decade of astronomical research ("1998-2010" corpus) and the contents of the science frontier panel report which contains high-priority research areas identified by the 2010 National Academies' Astronomy and Astrophysics Decadal Survey ("DS2010" corpus). Significant correlations also exist between model results of the 1998-2010 corpus and the submitted whitepapers to the Decadal Survey ("whitepapers" corpus). Importantly, we derive predictive metrics based on these results which can provide leading indicators of which content modeled by the topic models will become highly cited in the future. Using these identified metrics and the associations between papers and topic models it is possible to identify important papers for planners to consider. A preliminary version of our work was presented by Thronson etal. 2021 and Thomas etal. 2022.
翻译:我们总结了探索性研究,旨在探究应用于公开专业文本的机器学习(ML)技术是否能够显著增强天文学领域的战略规划。研究发现,基于潜在狄利克雷分配(LDA)的方法,利用天文学期刊论文内容,能够推断出高优先级的研究领域。尽管LDA模型的可解释性存在挑战,但我们发现这些模型可能与具有意义的关键词和科学论文存在强关联,从而允许对主题模型进行人工解读。将此类模型应用于过去十年("1998-2010"语料库)的天文学研究所得结果,与包含2010年美国国家科学院天文学与天体物理学十年调查("DS2010"语料库)所确定的高优先级研究领域的科学前沿小组报告内容之间存在显著相关性。1998-2010语料库的模型结果与提交至十年调查的白皮书("白皮书"语料库)之间也存在显著相关性。重要的是,我们基于这些结果推导出预测性指标,这些指标能够为主题模型所建模的内容提供未来是否将被高频引用的先行指标。利用这些已识别的指标以及论文与主题模型之间的关联,规划者有可能筛选出需要重点关注的重要论文。本研究的初步版本已由Thronson等人(2021年)和Thomas等人(2022年)发表。