Bayesian Hierarchical Model for Synthesizing Registry and Survey Data on Female Breast Cancer Prevalence

In public health, it is critical for policymakers to assess the relationship between the disease prevalence and associated risk factors or clinical characteristics, facilitating effective resources allocation. However, for diseases like female breast cancer (FBC), reliable prevalence data at specific geographical levels, such as the county-level, are limited because the gold standard data typically come from long-term cancer registries, which do not necessarily collect needed risk factors. In addition, it remains unclear whether fitting each model separately or jointly results in better estimation. In this paper, we identify two data sources to produce reliable county-level prevalence estimates in Missouri, USA: the population-based Missouri Cancer Registry (MCR) and the survey-based Missouri County-Level Study (CLS). We propose a two-stage Bayesian model to synthesize these sources, accounting for their differences in the methodological design, case definitions, and collected information. The first stage involves estimating the county-level FBC prevalence using the raking method for CLS data and the counting method for MCR data, calibrating the differences in the methodological design and case definition. The second stage includes synthesizing two sources with different sets of covariates using a Bayesian generalized linear mixed model with Zeller-Siow prior for the coefficients. Our data analyses demonstrate that using both data sources have better results than at least one data source, and including a data source membership matters when there exist systematic differences in these sources. Finally, we translate results into policy making and discuss methodological differences for data synthesis of registry and survey data.

翻译：在公共卫生领域，政策制定者评估疾病患病率与相关风险因素或临床特征之间的关系至关重要，这有助于实现资源的有效配置。然而，对于女性乳腺癌这类疾病，在县级等特定地理层级上可靠的患病率数据十分有限，因为金标准数据通常来自长期癌症登记系统，而这些系统未必收集所需的风险因素信息。此外，分别拟合模型与联合建模哪种方式能获得更优估计仍不明确。本文选取美国密苏里州的两个数据源以生成可靠的县级患病率估计：基于人群的密苏里癌症登记系统（MCR）和基于调查的密苏里县级研究（CLS）。我们提出一个两阶段贝叶斯模型来融合这两个数据源，同时考虑其在方法学设计、病例定义和信息收集方面的差异。第一阶段通过CLS数据的raking方法与MCR数据的计数方法估计县级女性乳腺癌患病率，以校准方法学设计和病例定义的差异。第二阶段采用带有Zeller-Siow先验系数的贝叶斯广义线性混合模型，融合具有不同协变量集合的两个数据源。数据分析表明，联合使用两个数据源比仅使用单一数据源能获得更优结果；当数据源存在系统性差异时，纳入数据源隶属关系变量具有重要影响。最后，我们将研究结果转化为政策建议，并探讨了登记数据与调查数据在融合过程中的方法学差异。