This week will be an introduction to the Python programming language. There are many languages used for data science, for example Matlab, R, Julia, C/C++, Java and so on. The reason we are spending a bit of time on Python in this course is two-fold: First of all, Python is becoming more and more popular for data science, scientific computing and general programming. Secondly it is because all the subjects we are teaching can be tried out using cool Python packages, for example mrjob for map-reduce. Sticking with a single language for the course will hopefully make things easier. It should be noted that it is not possible to learn all of the Python language in a single week, but we hope that everyone will get a taste of the language, and then develop their skills throughout the course.
After this week, you are supposed to know:
- How to program in Python.
- How to solve the exercises given below.
Important: You should solve the exercises using native python functionality. These exercises should not be solved effortlessly, using functions that magically turn base10 numbers in to binary strings, creates a bag of words matrix from a corpus, etc.
- Reading: Read http://stephensugden.com/crash_into_python/
Write a script with two methods. The first method should read in a matrix like the one here and return a list of lists. The second method should do the inverse, namely take, as input, a list of lists and save it in a file with same format as the initial file. The first method should take the file name as a parameter. The second method should take two arguments, the list of lists, and a filename of where to save the output.
Write a script that takes an integer N, and outputs all bit-strings of length N as lists. For example: 3 -> [0,0,0], [0,0,1],[0,1,0],[0,1,1],[1,0,0],[1,0,1],[1,1,0],[1,1,1]. As a sanity check, remember that there are 2^N such lists.
Do not use the bin-function in Python. Do not use strings at all. Do not import anything. Try to solve this using only lists, integers, if-statements, loops and functions.
Write a script that takes this file (from this Kaggle competition), extracts the request_text field from each dictionary in the list, and construct a bag of words representation of the string (string to count-list).
There should be one row pr. text. The matrix should be N x M where N is the number of texts and M is the number of distinct words in all the texts.
The result should be a list of lists ([[0,1,0],[1,0,0]] is a matrix with two rows and three columns).