Skip to main navigationSkip to main content
The University of Southampton
Southampton Statistical Sciences Research Institute

Cloning Models - Something for free? Seminar

Time:
14:30
Date:
2 June 2016
Venue:
Room 06 / 1077 (L/T A)

Event details

S3RI seminar

Abstract

Cloning in genetics means creation of an exact genetic copy of an organism. In a similar sense, it is possible to generate new datasets that have exactly the same estimated linear model parameters as the original dataset via what are “equivalent” or cloned models.

When collecting data is expensive, small datasets are common. Examples include DNA data at particular sites or for genes, quality control, ecology studies, agricultural field trials, and animal-based studies (where there are also ethical issues). In other situations, government statistics agencies cannot release original data for confidentiality reasons, or require data encryption.

Although simulation is used routinely to test statistical models [12], or to provide alternative datasets which are confidential, such simulated datasets contain both potential model mis-specification and modelling errors. Simulated and original data do not have the same fitted model. Nevertheless, such approximate methods are used routinely in statistics (e.g. in simulation studies) to supplement and test models, and to provide alternative datasets that can be publically released (e.g. CURFs - Confidentialised Unit Record Files - see [13]). In contrast, datasets generated using cloned models have zero modelling error. Cloning in its various forms thus offers potential improvement to standard bootstrap and jackknife methods, which generate simulated data that inevitably contain model error. Bootstrapping and jackknifing have created enormous interest and an extensive literature since publication of [1]. Cloning may possibly provide better methods where model error is not negligible, for example for testing of saturated models in which there are as many model parameters as observations. Cloning can also be used to remove random variation via smoothing to elucidate underlying phenomena, to better visualise an underlying fitted model, and to detect model aberrations.

Such supplementary data might be called cloned data, but the term already has multiple meanings (c.f. [3] with [11], where cloning for maximum likelihood estimation using Bayesian software is achieved by the simple device of replicating the original data many times).

To date, model cloning has been studied only for certain types of statistical model. For example, for any p-dimensional dataset, via model cloning we can generate 2p-1 further datasets that have identical multiple linear regression parameter estimates and hence model fit. See [3] and [4], which utilise orthogonal subspaces within the model-design matrix. Despite its novelty, model cloning already has known applications in data confidentiality, encryption, smoothing, and data visualisation, and relatively unexplored potential to improve hypothesis testing. Even the current cloning methods for regression and general linear models form a wide class, including the types of experimental designs used routinely in agriculture and industry, and the mixed linear models used in genetics, epidemiology and small area estimation. Cloning can also be used for database encryption, even if there is no interest at all in the underlying regression model - see [3].

Cloning for linear models can be achieved in a number of ways. See for example [3] and [6]. Non-full rank methods (via generalized inverses and matrix column spaces), as discussed in [5]-[10] and in the appendix to [2], are also relevant. Using subspace arguments, even for non-full rank models it is possible to use model cloning to construct different datasets where not only does a full fixed parameter linear model have identical estimates and estimated covariances, but so do all its submodels. By using a given error covariance structure, or with a relatively mild restriction on estimation of error covariance matrices, these results can be extended to linear mixed models.

Cloning via residuals, mentioned in the initial sections of [3], raises one of several further possibilities for extending model cloning methods beyond linear models. However, model cloning can already provide a straightforward but secure method of data encryption and for linear models has potential to underpin better practical methods of dealing with the all too common situation in which there is too little data, or the original data cannot be publically released.

 

References:

[1] Efron, B. & Tibishani, R. (1993) An Introduction to the Bootstrap, Chapman & Hall/CRC Monographs on Statistics & Applied Probability.

[2] Haslett, J. & Haslett, S. J. (2007) The three basic types of residuals for a linear model, International Statistical Review, 75, 1–24.

[3] Haslett, S.J. & Govindaraju, K. (2012) Data cloning: Data visualisation, smoothing, confidentiality, and encryption, Journal of Statistical Planning and Inference, 142: 410-422.

[4] Haslett, S. J & Govindaraju, K. (2009) Cloning data: generating datasets with exactly the same multiple linear regression fit. Australian & New Zealand Journal of Statistics, 51: 499-503.

[5] Haslett, S. J., Isotalo, J., Liu, Y. & Puntanen, S. (2013) Equalities between OLSE, BLUE and BLUP in the linear model, Statistical Papers, 55, 543-561, Springer. DOI: 10.1007/s00362-013-0500-7.

[6] Haslett, S. J. & Puntanen, S. (2011) On the equality of the BLUPs under two linear mixed models. Metrika, 74, 381-395.

[7] Haslett, S. J. & Puntanen, S. (2010a) Effect of adding regressors on the equality of the BLUEs under two linear models. Journal of Statistical Planning and Inference, 140, 104-110.

[8] Haslett, S. J. & Puntanen, S. (2010b) Equality of BLUEs or BLUPs under two linear models using stochastic restrictions. Statistical Papers, 51, 465-475.

[9] Haslett, S. J. & Puntanen, S. (2010c) A note on the equality of the BLUPs for new observations under two linear models. Acta et Commentationes Universitatis Tartuensis de Mathematica, 14, 27-33.

[10] Haslett, S.J., Puntanen, S. and Arendacká, B. (2015) The link between the fixed and mixed linear models revisited, Statistical Papers, 56, 3, 849-861. DOI 10:1007/s00362-014-0611-9.

[11] Lele, S.R., Nadeem, K. & Schmuland, B. (2010) Estimability and likelihood inference for generalized linear mixed models using data cloning, Journal of the American Statistical Association, 105, 1617-1625.

[12] McCullagh, P. (2002) What is a statistical model? The Annals of Statistics, 30: 1225-1267.

[13] United Nations Economic Commission for Europe (2014) Managing Statistical Confidentiality and Microdata Access – Principles and Guidelines of Good Practice: Interim Guidelines, United Nations Economic Commission for Europe,

http://www1.unece.org/stat/platform/display/confid/Managing+Statistical+Confidentiality+and+Microdata+Access

 

Speaker information

Dr. Stephen Haslett, The Australian National University. Professor and Director of the Statistical Consulting Unit

Privacy Settings