Summary: NHANES, the National Health and Nutrition Examination Survey, is a program of studies led by the Centers for Disease Control and Prevention (CDC) designed to assess the health and nutritional status of adults and children in the United States (U.S.). NHANES data is frequently used by biostatisticians and clinical scientists to study health trends across the U.S., but every analysis requires extensive data management and cleaning before use and this repetitive data engineering collectively costs valuable research time and decreases the reproducibility of analyses. Here, we introduce NHANES-GCP, a Cloud Development Kit for Terraform (CDKTF) Infrastructure-as-Code (IaC) and Data Build Tool (dbt) resources built on the Google Cloud Platform (GCP) that automates the data engineering and management aspects of working with NHANES data. With current GCP pricing, NHANES-GCP costs less than $2 to run and less than $15/yr of ongoing costs for hosting the NHANES data, all while providing researchers with clean data tables that can readily be integrated for large-scale analyses. We provide examples of leveraging BigQuery ML to carry out the process of selecting data, integrating data, training machine learning and statistical models, and generating results all from a single SQL-like query. NHANES-GCP is designed to enhance the reproducibility of analyses and create a well-engineered NHANES data resource for statistics, machine learning, and fine-tuning Large Language Models (LLMs). Availability and implementation" NHANES-GCP is available at https://github.com/In-Vivo-Group/NHANES-GCP
翻译:摘要:国家健康与营养调查(NHANES)是由美国疾病控制与预防中心(CDC)主导的一项研究计划,旨在评估美国成人与儿童的健康及营养状况。NHANES数据被生物统计学家和临床科学家广泛用于研究美国健康趋势,但每次分析都需要进行大量的数据管理与清洗,这些重复性的数据工程工作不仅耗费宝贵的科研时间,也降低了分析的可重复性。在此,我们介绍NHANES-GCP——一套基于谷歌云平台(GCP)构建的开发工具包,包含用于Terraform的云开发工具包(CDKTF)基础设施即代码(IaC)以及数据构建工具(dbt)资源,可自动化处理NHANES数据的数据工程与管理环节。按当前GCP定价,运行NHANES-GCP成本低于2美元,托管NHANES数据的年持续成本低于15美元,同时为研究人员提供可直接用于大规模分析的整洁数据表。我们提供了利用BigQuery ML从单一类SQL查询中完成数据选择、数据整合、机器学习及统计模型训练、结果生成全流程的示例。NHANES-GCP旨在提升分析的可重复性,并为统计学、机器学习及大型语言模型(LLM)微调构建一个工程化完善的NHANES数据资源。可用性与实施:NHANES-GCP可通过https://github.com/In-Vivo-Group/NHANES-GCP获取。