Week 09: Feature Hashing and LSH

Page

Content:

This week will be an introduction to Feature Hashing and Locality Sensitive Hashing (LSH). Watch my video here about feature hashing and locality sensitive hashing.

Watch my lecture here: https://youtu.be/_1XQFIa15lM

Learning objectives:

After this week, you are supposed to know:

  • What feature learning and locality sensitive hashing is
  • How to implement and use feature hashing and locality sensitive hashing
  • Use cases for feature hashing and locality sensitive hashing

Exercises:

Exercise 1:

Download all the following files: https://github.com/fergiemcdowall/reuters-21578-json/tree/master/data/full

Load them into Python. Remove all articles that do not have at least one topic and a body. This should give you 10377 articles remaining.

Make a bag-of-words encoding of the body of the articles. Remember to lower-case all words. This should result in a matrix, with one row for each article (which has a topic) and one column for each unique word in those articles.

How many features/columns do you get in your term/document matrix from your bag-of-words encoding? One solution (mine) gives a matrix of size 10377 x 70794 (this is without removing punctuation).

Train a random forest classifier to predict if an article has the topic ‘earn’ or not from the body-text (encoded using bag-of-words). Use 80% of the data for training data and 20% for test data. Use 50 trees (n_estimators) in your classifier.

How does the classifier perform (how large a fraction of the documents in the test set are classified correctly)?

Now implement feature hashing and use 1000 buckets instead of the raw bag-of-words encoding.

How does this affect your classifier performance?

Exercise 2:

Implement your own MinHash algorithm.

Using the same dataset as before, hash the body of some of the articles (encoded using bag of words) using MinHash – to get the code to run faster, work with just 100 articles to begin with.

Try with different number of hash functions/permutations (for example 3, 5, 10).

Look at which documents end up in the same buckets. Do they look similar? Do they share the same topics?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s