Distributed Multivariate Regression Modeling For Selecting Biomarkers Under Data Protection Constraints

The discovery of clinical biomarkers requires large patient cohorts and is aided by a pooled data approach across institutions. In many countries, data protection constraints, especially in the clinical environment, forbid the exchange of individual-level data between different research institutes, impeding the conduct of a joint analyses. To circumvent this problem, only non-disclosive aggregated data is exchanged, which is often done manually and requires explicit permission before transfer, i.e., the number of data calls and the amount of data should be limited. This does not allow for more complex tasks such as variable selection, as only simple aggregated summary statistics are typically transferred. Other methods have been proposed that require more complex aggregated data or use input data perturbation, but these methods can either not deal with a high number of biomarkers or lose information. Here, we propose a multivariable regression approach for identifying biomarkers by automatic variable selection based on aggregated data in iterative calls, which can be implemented under data protection constraints. The approach can be used to jointly analyze data distributed across several locations. To minimize the amount of transferred data and the number of calls, we also provide a heuristic variant of the approach. When performing global data standardization, the proposed method yields the same results as pooled individual-level data analysis. In a simulation study, the information loss introduced by local standardization is seen to be minimal. In a typical scenario, the heuristic decreases the number of data calls from more than 10 to 3, rendering manual data releases feasible. To make our approach widely available for application, we provide an implementation of the heuristic version incorporated in the DataSHIELD framework.\

翻译：临床生物标志物的发现需要大规模患者队列，跨机构的数据整合分析方法对此具有重要价值。在许多国家，数据保护约束（尤其在临床环境中）禁止不同研究机构之间交换个体层面数据，这阻碍了联合分析的开展。为解决该问题，目前仅允许交换不具披露性的聚合数据，且此类交换通常需要手动操作，并在数据传输前获得明确许可——即数据调用的次数及数据量均应受到限制。由于通常仅传输简单的聚合摘要统计量，这种方法无法支持变量选择等复杂任务。虽然已有研究提出需要传输更复杂聚合数据或采用输入数据扰动的方法，但这些方法要么无法处理大量生物标志物，要么会导致信息损失。本文提出一种基于迭代调用聚合数据进行自动变量选择的多元回归方法，可在满足数据保护约束的条件下实施。该方法适用于分析分布于多个地点的联合数据。为最小化传输数据量和调用次数，我们还提供了该方法的启发式变体。在全局数据标准化条件下，本方法与整合个体层面数据分析能获得相同结果。模拟研究表明，局部标准化引入的信息损失极小。在典型场景中，启发式方法将数据调用次数从10次以上减少至3次，使得手动数据发布具备可行性。为促进该方法的广泛应用，我们已在DataSHIELD框架中集成并实现了启发式版本。