COMP6235 Foundations of Data Science
Module Overview
Welcome to the Foundations of Data Science! 'Data scientist' has been described as the sexiest job of the 21st century, with the demand for highly skilled practitioners rising quickly to leverage the increasing amount of data available for study. As the amount of data increases, so too does the need for employees who can extract meaningful insights from this data. This course is designed to introduce you to a range of topics and concepts related to the data science process. It will cover the technical pipeline from data collection, to processing, analysis and visualisation. You will be introduced to and gain knowledge of various topics such as statistics, crawling data, data visualisation, advanced databases and cloud computing, along with a toolkit to use with data (including R, D3, Google Refine and Hadoop). The course will include a mix of lectures, tutorials, hands-on exercises and invited talks from expert data science practitioners. Coursework will allow you to gain experience using the theory and techniques delivered in the lectures, while the group project will give you the chance to apply knowledge of the data science process and toolkit in the development of a data science application.
Aims and Objectives
Learning Outcomes
Knowledge and Understanding
Having successfully completed this module, you will be able to demonstrate knowledge and understanding of:
- Key concepts in data science, including tools, approaches, and application scenarios
- Topics in data collection, sampling, quality assessment and repair
- Topics in statistical analysis and machine learning
- Topics in data processing at scale
- State-of-the-art tools to build data-science applications for different types of data, including text and CSV data
Subject Specific Intellectual and Research Skills
Having successfully completed this module you will be able to:
- Understand and apply the fundamental concepts and techniques in data science
Subject Specific Practical Skills
Having successfully completed this module you will be able to:
- Solve real-world data-science problems and build applications in this space
Syllabus
The course will introduce students to the data scientist toolkit and the underlying core concepts. It will cover the full technical pipeline from data collection (sampling methods, crawling) to processing and basic notions of statistical analysis and visualization. The module will also include advanced topics in high-performance computing, including non-relational databases and MapReduce. By taking this course the students will be provided with the basic toolkit to work with data (CSV, R, MongoDB). To support these learning objectives, the coursework will include exercises and a group project in which students will use existing open data sets and build their own application. The course will cover the following concepts: - Fundamentals and core terminology - Technology pipeline and methods - Application scenarios and state of the art - Data collection (sampling, crawling) - Data analytics (statistical modeling, basic concepts, experiment design, pitfalls, R) - Data interpretation and use (visualization techniques, pitfalls, D3) - High-performance computing (parallel databases, MapReduce, Hadoop, NoSQL) - Cloud computing (principles, architectures, existing technologies)
Learning and Teaching
Teaching and learning methods
Lectures and tutorials, as well as coursework (group project, exercises).
Type | Hours |
---|---|
Follow-up work | 6 |
Lecture | 12 |
Tutorial | 24 |
Wider reading or practice | 31 |
Preparation for scheduled sessions | 6 |
Completion of assessment task | 71 |
Total study time | 150 |
Assessment
Summative
Method | Percentage contribution |
---|---|
Continuous Assessment | 100% |
Repeat
Method | Percentage contribution |
---|---|
Set Task | 100% |
Referral
Method | Percentage contribution |
---|---|
Set Task | 100% |
Repeat Information
Repeat type: Internal & External