Skip to main navigationSkip to main content
The University of Southampton
The Alan Turing Institute

About Data Science

Data science incorporates both the technical and non-technical issues and phenomena that arise from the explosive rise in data. Data science provides methods of analysing this data and visualising the results, ultimately offering data scientists a set of tools to tell stories using data. With these skills, organisations can create further understanding, which in turn they can use to drive further revenue from new, innovative sources. The TED talk below highlights some of the difficulties and issues that arise with big data.

The Data Science Pipeline 

The Data Science process covers the entire story of analysing data: from planning the experiment design to reporting the results in a meaningful and insightful way. This begins with data collection, where data is gathered from one or more location(s). Data is crawled to extract and join datasets that will provide insights into the research question. Exploratory data analysis is then carried out to identify any wrong or missing data, and where data cleaning is required in order to ensure that the dataset is accurate.

Once the data has been cleaned, it is on to analysis. Here, a model is often designed to represent the data, and algorithms and statistical methods are applied to mine the data and produce outputs. It is at this stage that machine learning may be used to begin to be able to automate processes based on previous experience. Once results have been obtained, they must then be reported to various audiences, and this often requires visualisation techniques which can help to highlight critical aspects within the data. If the insights are communicated well, then others can adopt the model or processes used to acquire them to go on and build a data science application.

The foundation to data science is the phenomenon of ‘Big Data’, representing the massive amounts of data rapidly being placed on the Web that are so large they require novel approaches to managing and organising compared to the traditional database solution. Data scientists can help to manage and process this data, producing the insights that would otherwise be lost or drowned out for organisations ranging from large corporations to government agencies.

When talking about Big Data, four "V's" are often used: volume, velocity, variety and veracity, to help descripe the key characteristics of 'big data'. These refer to the amount of data, the fact that is arriving quickly and often in real-time, it often contains different forms of data (variety), and contains a certain amount of uncertainty (veracity).

Smart Data is not necessarily 'big data'; instead it focuses in on the valuable bits of data without the noise commonly associated with big data. It therefore focuses on the data and information that makes sense, allowing actionable insights to be obtained. Often semantic technologies are adopted to allow linking with other key datasets, which therefore affords more insights to be gained from the data.

Open Data has become a popular idea within governments who wish to increase transparency by making their data available to the public. The Open Data paradigm provides the potential users with access to vast amount of information and evolves into a particular promising concept when the data are released according to Linked Data principles, as Linked Open Data.

Data scientists use their skills to produce analysis and visualisations on these large amounts of data that can tell a story to different stakeholders (digital storytelling). This opens up their findings so that they can be interpreted and understood by the many interested parties, and highlights the importance of the data scientist to an organisation. They act as a vital bridge between the complexity of the data and the policy- or decision-makers who can instigate real change with the knowledge that can be extracted from it.

Privacy Settings