Distributed Data and its Application in Least Squares Model Averaging

Least Squares Model Averaging for Distributed Data

Authors: Haili Zhang, Zhaobo Liu, Guohua Zou; Journal of Machine Learning Research 24(215):1−59, 2023.

Abstract

The divide and conquer algorithm is commonly used in big data analysis. However, the theory of model averaging in big data scenarios has not been fully developed. This paper aims to bridge this gap by proposing two divide-and-conquer-type model averaging estimators for linear models with distributed data. Under certain regularity conditions, we demonstrate that the weights obtained from the Mallows model averaging criterion converge in L2 to the theoretically optimal weights that minimize the risk of the model averaging estimator. We also provide bounds for the in-sample and out-of-sample mean squared errors and prove the asymptotic optimality of the proposed model averaging estimators. These findings remain valid even when the dimensions and the number of candidate models are divergent. Simulation results and an analysis of real airline data illustrate that the proposed model averaging methods outperform commonly used model selection and model averaging methods in distributed data cases. Our approaches contribute to the theory of model averaging in distributed data and parallel computations, and can be applied in big data analysis to save time and reduce computational burden.

[Abstract]

[pdf][bib]