Content:
This week will be an introduction to MapReduce and the Python library mrjob.
Watch my lecture here: https://youtu.be/nbQADQs57x8
Learning objectives:
After this week, you are supposed to know:
- What MapReduce is
- How to write a MapReduce procedure
- How to implement and run an mrjob script locally
Resources:
You should start by learning about the yield-command in Python. This is a command which you might not have seen before and it takes a few minutes to understand it. It is heavily used in the mrjob library that we will use this week. Read about yield here: http://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do
Finally go and learn about mrjob, a library for working with MapReduce in Python on your own computer. You can watch this: http://youtu.be/DrCKkQeU4kg?t=3m and check the full documentation here: https://pythonhosted.org/mrjob/guides.html
Exercises:
Exercise 1:
Define and implement a MapReduce job to count the occurrences of each word in a text file. Document that it works with a small example.
Exercise 2:
Define and implement a MapReduce job that determines if a graph has an Euler tour (all vertices have even degree) where you can assume that the graph you get is connected. This file https://www.dropbox.com/s/usdi0wpsqm3jb7f/eulerGraphs.txt?dl=0 has 5 graphs – for each graph, the first line tells the number of nodes and the number of edges
. The next
lines tells which two nodes are connected by an edge. Two nodes can be connected by multiple edges.
It is fine if you split the file into 5 different files. You do not need to keep the node and edge counts in the top of the file.
Document that it works using a small example.
Exercise 3:
Make a MapReduce job which counts the number of triangles in a graph. Use the following graph http://www.cise.ufl.edu/research/sparse/matrices/SNAP/roadNet-CA.html. It is formatted such that each line contains an link between two nodes (remember to remove the header in the beginning of file)
Document that it works using a small example and run it on the large file to see that you get the right results.