Southampton Statistical Sciences Research Institute

Research project: Disclosure risk control using synthetic data

Currently Active:: Yes

The synthetic data approach models the joint distribution of the variables in the dataset.

Project Overview

Synthetic values are generated from their predictive distributions to replace actual sensitive values in the data. In this way simulated values replace part or even all of the dataset, thus limiting the disclosure risk. This procedure is repeated several times to generate multiple synthetic datasets. These datasets are then released by the agency to the public. Researchers perform their analysis on each of the synthetic datasets seperately and then, using simple combining rules, are able to obtain appropriate interval estimates for a variety of estimands.
An important goal is being able to reflect statistical properties of the original data in the synthetic datasets in order to increase the utility of the released data to researchers.
Robin Mitra (S3RI) and Jerry Reiter (Duke University) has investigated strategies to adjust survey weights in partially synthetic data. Details are available in:

Mitra R and Reiter JP. Adjusting survey weights when altering identifying design variables via synthetic data. In Privacy in Statistical Databases. Ed. J Domingo Ferrer and L Franconi, Lecture Notes in Computer Science, Springer 2006: 177-188.

It is also important to estimate the risks of identification disclosure with the partially synthetic data approach. Research in this area is reported in the following S3RI working paper:

Staff

Share this research project Share this on Facebook Share this on Twitter Share this on Weibo