Multi-institutional electronic health record (Multi-EHR) data have emerged as a powerful resource for developing predictive models to support clinical decisions and for generating reliable real-world evidence. By aggregating information from diverse patient populations and institutions, they enhance the robustness and generalizability of models and findings. However, analyzing multi-EHR remains challenging because disparate institutions rarely map all data elements to common ontologies, and raw EHR codes are often overly granular and institution-specific, fragmenting representations of the same clinical concept. Hence, integrative analysis must overcome two key hurdles: harmonizing codes with the same clinical meaning (synonymy), and aligning institutional feature spaces. To address these challenges, we propose SMILE, a Spherical Mixture Integration for Latent Embedding alignment across multi-source feature spaces, where embeddings from heterogeneous sources serve as privacy-preserving summaries of clinical concepts and sparse relational pairs provide weak supervision. Synonymy is modeled via a mixture of von Mises-Fisher distributions, yielding unified representations of semantically equivalent raw codes. We develop a composite quasi-likelihood estimator with non-asymptotic error bounds for the latent representations and mixture mean directions and consistent synonym-cluster recovery, quantifying the gains from integrating multiple sources and knowledge-graph information. Simulations and a multi-institutional EHR application demonstrate improved alignment and synonym clustering.
翻译:暂无翻译