02807 Computational Tools for Big Data

THIS IS FOR THE COURSE IN 2016. DON’T USE THIS SITE AS INFORMATION FOR THE COURSE IN 2017.

When and where
This course will run totally digital this year. Lectures will be available on video, assistant teachers and I will be available digitally (more info to come) and assignments will be handed in using Peergrade.io.

What
This course will give a short and intensive introduction to a large set of computational tools and techniques for dealing with large data. We will touch upon for example: The UNIX terminal, version control, Python, MapReduce, Apache Spark and Graph databases.

Why
There are multiple reasons for teaching this course. Many of the people who do work on machine learning and scientific computing come from a mathematical background where the focus has been on the mathematical theory rather than on the practical tools and implementations of these theories. We are teaching this course to help students get a hold of the different tools and technologies available for working with large scale data.

How
Instead of giving long lectures, we will use a more hands-on approach to teaching. Each week there will be a short video lecture on the subject – and then you will be solving exercises.

During the course, there will be a number of homework assignments which cover the material you have worked on in the class. In the end you will be working on a partly self-defined project. There is no exam in the course.

Who
This course was developed by David Kofoed Wind from The Section for Cognitive Systems at DTU Compute.

Prerequisites
For this course, it is assumed that you are familiar with programming in some language and that you have taken courses within an area which has a need for large scale computing, such as machine learning, image analysis, scientific computing etc. We will be using Python throughout the course, and week 2, 3 and 4 will introduce Python and relevant packages.

Evaluation
The course is graded on the 7-step scale. The grades will be based on the basis of the homework assignments and the quality of the peer-feedback given by you.

Syllabus

Week 1 – UNIX, Git and EC2
Week 2 – Python
Week 3 – Python Libraries
Week 4 – DBSCAN
Week 5 – SQL and NoSQL
Week 6 – Graph Databases
Week 7 – MapReduce
Week 8 – Apache Spark
Week 9 – Feature Hashing and LSH
Week 10 – Project work
Week 11 – Project work
Week 12 – Project work
Week 13 – Project work

Assignments and evaluation

There are six assignments in this course, covering most of the curriculum. The assignments will be equal to the exercises plus a project in the end. You can work alone on the assignments, or be in groups of maximum 3 people.

The assignments are peer-graded by other students in the class. After handing in an assignment, you will receive a number (likely 3) of assignments from other groups which you have to grade.

Your grade will depend both on your own assignments, and on your grading of the assignments from other groups. When evaluating your peer-grading of other assignments, we will look at whether your assessment is correct, and well-argued.

Note that your grade is not directly dependent on the grades of other students. We (the teachers) will make the decision on which grade every students receives, using the peer-grades as additional data.

Information about assignments and deadlines will be available on Peergrade.io.