Skip to main navigationSkip to main content
The University of Southampton
Southampton Statistical Sciences Research Institute

Data integration

Data integration involves combining data residing in different sources to enable statistical inference or generate new statistical data for purposes that cannot be served by the data in their primary state or based on each source separately. The process can yield significant gains for scientific as well as commercial investigations, and is increasingly being used to address both the volume and the need to share existing data.

Statistical theories for data integration broadly fall into three subject domains:

  • Record linkage refers to the situation where data from two (or more) sources are matched or linked to create a joint record for each common unit. Particular attention is given to the case where a unique identifier of the units does not exist, so that the outcome of the matching process acquires a probabilistic nature.
  • Data fusion (or statistical matching) refers to the situation where the units of the distinct datasets do not overlap, or the proportion of common units is too low to warrant a meaningful placement in the realm of record linkage. The problem is then framed as one of inference of the joint statistical distribution based on the knowledge or observations of the relevant marginal distributions.

Micro integration is needed when the linked dataset, either by record linkage or data fusion, contains conflicting information on related, similar or even supposedly identical measurements. The aim of micro integration is to achieve cohesive statistical data the unit level.

Share this research area Share this on Facebook Share this on Twitter Share this on Weibo
Privacy Settings