SWE404 Big Data Analytics

Undergraduate course of software engineering, Xiamen University Malaysia, 2020-04

We look at the details of the big data tools Hadoop, Spark and related tools that provide SQL-like access to unstructured data. Some more advanced techniques such as Spark Streaming and MLlib will also be introduced. Based on Python, we use PySpark as the main programming tool to implement big data applications. We also introduce some machine learning techniques such as classification, regresion, clustering and collaborative filtering and how to implement them to real applications using PySpark and MLlib API.

Lecture Notes

Lecture 1: Introduction

Lecture 2: Hadoop and HDFS

Lecture 3: HBase

Lecture 4: MapReduce and YARN

Lecture 5: Spark I

Lecture 6: Spark II

Lecture 7: Machine Learning and MLlib

Lecture 8: Classification and Regression Algorithms I

Lecture 9: Classification and Regression Algorithms II

Lecture 10: Unsupervised Learning Algorithms

Lecture 11: Recommender Systems & Collaborative Filtering

Lecture 12: Spark Streaming

Lecture 13: Data Visualization

Assignments and Project

Assignment 1 Dataset

Assignment 2 Dataset

Assignment 3 Dataset

Assignment 4

Assignment 5 Dataset

Project Dataset