This week will be an introduction to Apache Spark and the Python library pyspark.
Watch my lecture here: https://www.youtube.com/watch?v=KCoLQ7BZosA
After this week, you are supposed to know:
- What Apache Spark is
- How to write a Spark program
- How to implement and run a pyspark script locally
- Spark programming guide: http://spark.apache.org/docs/latest/programming-guide.html
- Documentation for pyspark: http://spark.apache.org/docs/latest/api/python/
Exercise 0 (don’t include in assignment):
You should install Spark and pyspark. Follow the following steps:
- Download and install Vagrant from here: https://www.vagrantup.com/
- Download the following Vagrant-file: https://www.dropbox.com/s/q8yj7qd5hifkhfj/Vagrantfile?dl=0
- Go to the folder of the Vagrantfile
- Run “vagrant up” to boot the virtual machine (this will probably take some time)
- Access the Jupyter web UI for running IPython notebooks by navigating your web browser to “http://localhost:8001” (or “http://127.0.0.1:8001/“)
- Download the following iPython notebook with some examples: https://www.dropbox.com/s/hbz98qnhbspc1mf/Spark%20test.ipynb?dl=0
- Upload the notebook using the Upload-button in the ipython notebook in the browser. Note: The upload button in iPython is also a good way to copy the data files to the Vagrant virtual machine.
- Open the notebook in the browser and run the code!
Write a Spark job to count the occurrences of each word in a text file. Document that it works with a small example.
Write a Spark job that determines if a graph has an Euler tour (all vertices have even degree) where you can assume that the graph you get is connected. This file https://www.dropbox.com/s/usdi0wpsqm3jb7f/eulerGraphs.txt?dl=0 has 5 graphs – for each graph, the first line tells the number of nodes and the number of edges . The next lines tells which two nodes are connected by an edge. Two nodes can be connected by multiple edges.
It is fine if you split the file into 5 different files. You do not need to keep the node and edge counts in the top of the file.
Document that it works using a small example.
You are given a couple of hours of raw WiFi data from my phone: https://www.dropbox.com/s/964gq5o5bkzg7q3/wifi.data?dl=0
Compute the following things using Spark:
1. What are the 10 networks I observed the most, and how many times were they observed? Note: the
bssid is unique for every network, the name (
ssid) of the network is not necessarily unique.
2. What are the 10 most common wifi names? (
3. What are the 10 longest wifi names? (again,