Multi-source Learning via Completion of Block-wise Overlapping Noisy Matrices

Doudou Zhou, Tianxi Cai, Junwei Lu; 24(221):1−43, 2023.

Abstract

Electronic healthcare records (EHR) serve as a valuable resource for healthcare research. One challenge in effectively utilizing EHR data is representing its features, which comprise unstructured clinical narratives and structured codified data. Matrix factorization-based embeddings, trained using summary-level co-occurrence statistics of EHR data, offer a promising solution for feature representation while safeguarding patient privacy. However, these methods do not perform well with multi-source data that have overlapping but non-identical features. To address the issue of multi-source learning, we propose a novel word embedding generative model. To obtain multi-source embeddings, we introduce an efficient algorithm called Block-wise Overlapping Noisy Matrix Integration (BONMI) that optimally aggregates the pointwise mutual information matrices from multiple sources, with a theoretical guarantee. Our algorithm can also be applied to other multi-source data integration problems with a similar data structure. An additional contribution of BONMI is its consideration of the missing mechanism, beyond the prevalent assumption of entry-wise independent missing, in the field of matrix completion. We demonstrate that the entry-wise missing assumption is not necessary for recovery. We prove the statistical rate of our estimator, which is comparable to the rate under independent missingness. Simulation studies show that BONMI performs well under various configurations. Furthermore, we showcase the utility of BONMI by integrating multi-lingual multi-source medical text and EHR data to perform two tasks: (i) co-training semantic embeddings for medical concepts in both English and Chinese, and (ii) translating medical concepts between English and Chinese. Our method exhibits an advantage over existing methods.

[abs]

[pdf][bib]
      
[code]