This week we will introduce some of the cool libraries for processing data with Python. The libraries we will be looking at are:
- Numpy/Scipy for handling matrices and for scientific computing
- Scikit-learn for machine learning
- Cython for writing C extensions for Python code
- Pandas for managing and working with data structures and data analysis workflows
Using some of these libraries makes Python a really powerful tool for data analysis, but each library has tons of functionality. During this week we will focus on outlining the capabilities of each library, so that you will pick the right subset the next time you are facing a problem.
After this week, you are supposed to know:
- How to install and load each of the libraries listed above.
- What the capabilities are of each of the above Python libraries.
- How to use basic features of each library listed above.
- Reading: Read about Numpy and Scipy here http://www.engr.ucsb.edu/~shell/che210d/numpy.pdf
- Reading: Read about scikit-learn here http://scikit-learn.org/stable/tutorial/basic/tutorial.html
- Reading: Learn about Cython by watching http://www.youtube.com/watch?v=JKCjsRDffXo and if you want hardcode details go for http://nbviewer.ipython.org/github/iminuit/iminuit/blob/master/tutorial/hard-core-tutorial.ipynb
- Reading: Learn about Pandas by reading http://byumcl.bitbucket.org/bootcamp2013/labs/pandas.html or watching http://www.youtube.com/watch?v=MxRMXhjXZos
Exercise 3.1 (numpy):
Write a script which reads a matrix from a file like this one and solves the linear matrix equation Ax=b where b is the last column of the input-matrix and A is the other columns. It is okay to use the solve()-function from numpy.linalg. Does the result make sense?
Exercise 3.2 (scipy):
Write a script that reads in this list of points (x,y), fits/interpolates them with a polynomial of degree 3. Solve for the (real) roots of the polynomial numerically using Scipy’s optimization functions (not the root function in Numpy). Does the result make sense (plot something to check).
Exercise 3.3 (pandas):
Do the first two exercises (Todo’s) at the bottom of http://byumcl.bitbucket.org/bootcamp2013/labs/pandas.html
Exercise 3.4 (scikit-learn):
Last week you read in a dataset for this Kaggle competition and created a bag-of-words representation on the review strings. Train a logistic regression classifier for the competition using your bag-of-words features (and possibly some of the others) to predict the variable “requester_received_pizza”. For this exercise, you might want to work a little bit more on your code from last week. Use 90% of the data as training data and 10% as test data.
If you don’t know anything about machine learning, try to Google a bit and figure out what training and test data is, and how you train a classifier.
How good is your classifier? Discuss the performance of the classifier.
Exercise 3.5 (cython):
Write a simple Python function for computing the sum with 10,000 terms (this should be around 1.644), 500 times in a row (to make the execution time measurable). Now compile the code with Cython and see how much speedup you can achieve by this. Remember to declare your variable types.
One thought on “Week 03: Python Libraries”
In Exercise 3.5 you should not use Python’s sum()-function since Cython is not able to optimize this. Use regular loops instead.
I was able to get ~25 times speedup.
LikeLiked by 1 person