Assignments for intensive NLP course, University of Helsinki
Carry out all the exercises below and submit your answers on Moodle. Also submit a single Python file containing your full implementation.
For this exercise session, we will mostly be using these five ‘documents’.
documents = ['Wage conflict in retail business grows',
'Higher wages for cafeteria employees',
'Retailing Wage Dispute Expands',
'Train Crash Near Petershausen',
'Five Deaths in Crash of Police Helicopter']
In this exercise, we will build a simple document-term matrix for the documents in Exercise 0.
For each document, convert to lowercase, tokenize, remove stopwords and then lemmatize.
You can use your code from day 2’s assignment to do all of these
things: the filter_text()
function. (You will need to remove the
sentence tokenizer, but otherwise you can reuse the code exactly.)
Construct a vocabulary for your corpus which is all the words that appear in the documents minus stopwords.
Construct a document-term matrix by going through each document and checking if the vocabulary word is present or not.
The shape of the matrix will be the no. of documents by the vocabulary size (n_docs x vocab_size)
.
Try importing Scitkit-learn:
import sklearn
If you do not have it installed, install it in your virtual environment:
pip install scikit-learn
Scikit-learn actually has a class called the CountVectorizer to build document-term matrices easily and includes a number of options such as removing stopwords, tokenizing, indicating encoding (important for documents in other languages), and others. For more information, see the documentation. At the bottom of the page is a code snippet to build count vectors for each document. You can easily convert these to a binary doc-term matrix.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
counts = X.toarray() # Get the doc-term count matrix
dt = counts > 0 # Convert to a binary matrix
doc_term_mat = dt * 1 # If you prefer, represent as 1s and 0s
How does your doc-term matrix from 1.1 compare to your doc-term matrix from 1.2? Are they exactly the same or are their differences? If they are different, what could account for such difference?
For the next exercises, we will make use the doc-term matrix with count vectors produced by the CountVectorizer
.
Suppose you have the query ‘retail wages’. Rank the documents by relevance to this query by getting the dot product of the query by the doc-term matrix.
To convert the query string into a vector, use the transform()
method of the vectorizer you created in the previous exercise. Remember that the vectorizer expects a list of strings.
Use numpy
’s dot()
to compute dot products.
If necessary, remind yourself of what happens when you take a dot product of a matrix and vector. Looking at the diagrams on the lecture slides might also help.
Normalize the count vectors of the doc-term matrix by the document length and perform the same relevance ranking.
In the previous exercise, our doc-term matrix is composed of count vectors where each element in the vector is the number of times a word appeared in the document.
In this exercise, we will convert our doc-term matrix which is composed of count vectors to TF-IDF vectors. Construct a TF-IDF doc-term matrix using the TfidfVectorizer from Scikit-learn. This implements the TF-IDF calculations seen in lectures.
Perform the same relevance ranking that we did in Exercise 2.1 by getting the dot product of the same query with your new TF-IDF doc-term matrix. Don’t forget to convert the query string to a vector using the transform()
method of the TfidfVectorizer
this time.
Using the doc-term matrix from Exercise 2.2, use cosine similarity for each document pair to find which two documents are most similar to each other.
You can use the cosine_similarity()
method from Scikit-learn for this.
Tip: The itertools
package can produce the document pairs so you don’t have to construct them yourselves.
import itertools
doc_count = len(documents)
doc_list = [i for i in range(doc_count)]
doc_pairs = list(itertools.combinations(doc_list, 2))
# doc_pairs contain tuples where each tuple is a pair of document index numbers
Suppose you are given two new documents that you have not seen so far:
new_docs = [
'Plane crash in Baden-Wuerttemberg', # Doc 3a
'The weather' # Doc 3b
]
Construct the TF-IDF matrix for these unseen documents (use transform()
again, not fit_transform()
) and find the documents from our original corpus that are most similar to each
using cosine similarity.
We will use the Gensim package to perform topic modelling on a corpus of news articles.
Check whether Gensim is installed and importable:
import gensim
If not, install it in your virtual environment:
pip install gensim
Topic modelling works better if we have more data. We have provided de-news.txt, which consists of 9 short news stories separated with tags. Use the following method to separate the articles (you can also preprocess the articles to remove stopwords, punctuations, etc. if you want).
def prepare_dataset(filename):
articles = []
text = open(filename,'r').read().split()
index_start = list(np.where(np.array(text)=="<DOC")[0])
for i in range(len(index_start)-1):
start_art = index_start[i]+2
end_art = index_start[i+1]
article = text[start_art:end_art]
articles.append(article)
return articles
The most popular topic modelling method is LDA. The following lines will train a topic model with two topics using Gensim:
from gensim.models import LdaModel
from gensim import corpora
common_dictionary = corpora.Dictionary(articles)
# Transform each doc into a bag of words
common_corpus = [common_dictionary.doc2bow(a) for a in articles]
# This line is the actual training part and might take a few seconds
n_topics = 2
lda = LdaModel(common_corpus, id2word=common_dictionary, num_topics=n_topics, passes=200)
# After training is done, we can check the top words of each topic
for k in range(n_topics):
top_words = lda.show_topic(k, topn=5)