Big Data platforms is a 5 ECTS Master's level advanced course. This course focuses on big data platforms and on key algorithmic ideas and methods used to implement them. After completing this course you are able to list many of the key technologies used in big data processing and to select suitable methods for solving challenging big data processing tasks using cloud computing technologies. You will also be able to compare the scalability and fault tolerance implications of using the selected methodologies.
Main topics are:
- distributed computing,
- Warehouse-Scale Computers,
- fault tolerance in distributed systems,
- distributed file systems,
- distributed batch processing with the MapReduce and the Apache Spark (PySpark) computing frameworks, and
- distributed cloud based databases.
The course material will consist of lecture materials and exercises provided by the lecturer.
Course Target Audience
The course is suitable to those who are interested in big data platforms employed in cloud computing and have previous knowledge in programming, database systems and command line tools. Optional course in Data Science Master's Program. Also suitable for Computer Science Master's Program students. The course is suitable to University of Helsinki exchange students.
To attend this course, you must have:
- basic programming skills (Python),
- skills to work with command line tools in Linux, and
- basic knowledge in database systems (SQL).
The Lectures of the course will be will Zoom based lectures. The Zoom links to the lecture sessions and the slides will be made available below. Video recording of each of the lectures will be made available a couple of hours after the live lecture session. The link to the Zoom lectures is:
|Lecture date||Lecture time|
|Lecture 1||Tue 1.9.2020||10:00-11:30|
|Lecture 2||Thu 3.9.2020||12:00-13:30|
|Lecture 3||Tue 8.9.2020||10:00-11:30|
|Lecture 4||Thu 10.9.2020||12:00-13:30|
|Lecture 5||Tue 15.9.2020||10:00-11:30|
|Lecture 6||Thu 17.9.2020||12:00-13:30|
|Lecture 7||Tue 22.9.2020||10:00-11:30|
|Lecture 8||Thu 24.9.2020||12:00-13:30|
|Lecture 9||Tue 29.9.2020||10:00-11:30|
|Lecture 10||Thu 1.10.2020||12:00-13:30|
|Lecture 11||Tue 6.10.2020||10:00-11:30|
|Lecture 12 (recap lecture)||Thu 8.10.2020||12:00-13:30|
Lecture Slides and Videos
The Lecture slides contain all the material needed to pass the course, the videos go through this material and contain no additional information needed for the exam.
|Lecture Slides||Lecture Videos|
|Lecture 1||Lecture 1 slides||Lecture 1 video|
|Lecture 2||Lecture 2 slides||Lecture 2 video|
|Lecture 3||Lecture 3 slides||Lecture 3 video|
|Lecture 4||Lecture 4 slides||Lecture 4 video|
|Lecture 5||Lecture 5 slides||Lecture 5 video|
|Lecture 6||Lecture 6 slides||Lecture 6 video|
|Lecture 7||Lecture 7 slides||Lecture 7 video|
|Lecture 8||Lecture 8 slides||Lecture 8 video|
|Lecture 9||Lecture 9 slides||Lecture 9 video|
|Lecture 10||Lecture 10 slides||Lecture 10 video|
|Lecture 11||Lecture 11 slides||Lecture 11 video|
|Lecture 12 (recap)||Combined Lecture 1-10 slides||Lecture 12 video|
Home Exercise Schedule
The course will contain programming exercises where you will be using the Spark framework to solve Big Data processing tasks. We will be using the Python programming language based PySpark interface and will be doing several database query type analytics queries. Therefore basic programming skills using Python and knowledge about database programming, especially using the SQL query language will be very helpful for completing the home exercises. In addition to passing the Exam, you will need to get 50% (17 points) of the home exercise points to pass the course. You can also get +1 to the Exam grade if you get more than 80% (28 points or above) of the points (this is of course not applicable to grades 0 and 5). There are also six extra points beyond 100% (34 points) that students can obtain doing the tasks in an optional Extra assignment. These points are useful for students who would like to get a better overall assignment grade.
The schedule for the home exercises is as follows:
|Release Date||Due Date (23.59 UTC)|
|Introduction to Spark + RDD Programming||8.9, 10.9||25.9|
|Machine Learning (MLlib)||24.9||9.10|
|Extras (Optional for extra points)||15.10||30.10|
To complete the exercises you will have to use the JupyterHub notebook platform found at https://bigdata.cs.helsinki.fi. A short introduction video on how to use the platform to complete and submit the exercises can be found at https://youtu.be/F0mjmycxWUg.
The assignment grades take some time (a day or two) to get published/updated and are not instant at the moment. The current plan is to run the assignment grading scripts after late nights where there is less load on our servers. Therefore, please be aware of that and give some time before you check back on your assignment grades.Make sure to use the assignment submission validation feature before your submissions if you want to gain more confidence about your submission's correctness and final grade.
For any of your generic course-related questions and group discussions, use our Telegram group. Try to use the group to help each other out and discuss exercise and course-related issues. We will also periodically check the group to address the more important unanswered issues. For specific personal course-related questions, please use the course email address (email@example.com).
Course Telegram Group
The course has a Telegram group for helping fellow students. Lecturer and Course Assistants will periodically also join in the conversation. You can join to the groups through the link:
In addition to the programming exercises you need to pass the exam, which will be a Web based exam scheduled at the end of the course. The next exam is scheduled on Wednesday, 14th of April 2021 at 16:00-19:30 Finnish time. The exam will be available from the Exam Website:
Passing the Course
Passing both home exercises with at least 50% of the points gained and passing the exam are required to pass the course.