Module overview
The challenge of data mining is to transform raw data into useful information and actionable knowledge. Data mining is the computational process of discovering patterns in data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and data management.
This course will introduce key concepts in data mining, information extraction and information indexing; including specific algorithms and techniques for feature extraction, clustering, outlier detection, topic modelling and prediction of complex unstructured data sets. By taking this course you will be given a broad view of the general issues surrounding unstructured and semi-structured data and the application of algorithms to such data. At a practical level you will have the chance to explore an assortment of data mining techniques which you will apply to problems involving real-world data.
Linked modules
Prerequisite: COMP3206 or COMP3222 or COMP3223 or COMP6229 or COMP6245 or COMP6246
Aims and Objectives
Learning Outcomes
Knowledge and Understanding
Having successfully completed this module, you will be able to demonstrate knowledge and understanding of:
- Key concepts, tools and approaches for data mining on complex unstructured data sets (including multimedia mining, Twitter analysis, etc)
- State-of-the-art data-mining techniques including topic modelling approaches such as LDA, clustering techniques and applications of matrix factorisations
- Natural language processing techniques for extracting features from text
- Techniques for modelling and extracting features from non-textual data
- The theory behind modern data indexing systems
- Theoretical concepts and the motivations behind different data-mining approaches
Subject Specific Intellectual and Research Skills
Having successfully completed this module you will be able to:
- Conceptually understand the role of data-mining, together with the mathematical techniques this requires
Subject Specific Practical Skills
Having successfully completed this module you will be able to:
- Solve real-word data-mining, data-indexing and information extraction tasks
Syllabus
Key concepts:
- The importance of data-mining
- Real-world applications of data-mining (cyber-security, financial forecasting, trend prediction, etc)
- What is unstructured data
- - Modalities of data
- Underlying techniques
- - Inverted indexes
- - Matrix factorisation
- - Dimensionality reduction
Modelling data:
- Understanding Text
- - Bags of Words
- - TF-IDF
- Dealing with non-textual data
- - Feature extraction techniques
- - Bags of features
- - Encoding and embedding
Modern data indexing at scale
- Information retrieval models
- Ranking models
Unimodal data mining:
- Topic modelling (techniques such as LSA, pLSA, LDA, NNMF)
- Clustering (Hierarchical agglomerative, Spectral)
- Multi-dimensional scaling
- Mining graphs and networks (hubs and authorities [PageRank/HITS], spectral methods, etc.)
- Finding outliers
Multimodal data mining:
- Finding independent features (e.g ICA, NNMF)
- Finding correlations and making predictions (CL-LSI, classifiers, etc.)
- Collaborative filtering and recommender systems
Learning and Teaching
Type | Hours |
---|---|
Completion of assessment task | 20 |
Tutorial | 16 |
Wider reading or practice | 46 |
Follow-up work | 12 |
Lecture | 24 |
Revision | 20 |
Preparation for scheduled sessions | 12 |
Total study time | 150 |
Resources & Reading list
Textbooks
Toby Segaran (2007). Programming Collective Intelligence: Building Smart Web 2.0 Applications. O'Reilly.
Assessment
Summative
Summative assessment description
Method | Percentage contribution |
---|---|
Final Assessment | 70% |
Continuous Assessment | 30% |
Referral
Referral assessment description
Method | Percentage contribution |
---|---|
Set Task | 100% |
Repeat
Repeat assessment description
Method | Percentage contribution |
---|---|
Set Task | 100% |
Repeat Information
Repeat type: Internal & External