Assignments for intensive NLP course, University of Helsinki
Carry out all the exercises below and submit your answers on Moodle. Also submit a single Python file containing your full implementation.
Consider an information retrieval system that returns a retrieval set of 15 documents (retrieved
).
Each document in retrieved
is labelled as relevant ('R'
) or non-relevant ('N'
):
total_docs = 100
total_relevant = 10
retrieved = ['R', 'N', 'N', 'R', 'R', 'N', 'N', 'N',
'R', 'N', 'R', 'N', 'N', 'R', 'R']
Continuing the snippet given above, compute the numbers of true positives, false positives, true negatives, and false negatives. Then, compute the values of the following metrics (round the values to two decimal places):
Consider the following scenario: a database consists of 10,000 documents in total, of which 10 are relevant.
In exercises 2.1-3, we evaluate a POS tagger based on a hidden Markov model (HMM), which you implemented on Day 3.
Today, we will again use the Penn Treebank corpus that you used yesterday. You will already have downloaded yesterday using:
import nltk
nltk.download('treebank')
We use 80% of sentences for training, and the remaining 20% for testing.
The following code splits the corpus of sentences into training and test sentences,
and assigns test tokens and the correct tags into separate lists.
Train the HMM with training_sents
, as in exercise 2 of Day 3.
Download ass5utils.py into the same directory as
your source code.
from nltk.corpus import treebank
from nltk.tag.hmm import HiddenMarkovModelTagger
from ass5utils import split_corpus
training_sents, test_sents = split_corpus(treebank, 0.8)
test_tokens = [t[0] for s in test_sents for t in s]
correct_tags = [t[1] for s in test_sents for t in s]
hmm_tagger = HiddenMarkovModelTagger.train(training_sents)
Use the HMM to predict the tags for test_tokens
.
(If you’ve forgotten how to do this, refer back to your code from day 3.)
Then, compute the confusion matrix between the predicted tags and correct_tags
.
You can use the
nltk.metrics.ConfusionMatrix
class for this exercise.
(In the confusion matrix, rows are the correct tags and columns are the predicted tags.
That is, an entry cm[correct_tag, predicted_tag]
is the number of times a token with true tag correct_tag
was
tagged with predicted_tag
.)
(correct_tag, predicted_tag)
pair was the most common error? How many times did it occur?'NN'
. (Round the value to two decimal places.)We would like to know whether the HMM tagger is any good compared to naive baselines.
Now, implement the following functions:
random_tagger(tagset, tokens)
: given a list of tokens, assigns a POS tag randomly to each token.
(The tagset is defined in ass5utils.py.)
majority_tagger(training_sents, tokens)
: find the tag that is most common in the training sentences,
and tag each token with this tag.
Compute the overall accuracy of both baselines, and compare the values with the HMM.
Recall exercise 5 on Day 3, where you used the HMM as a language model.
Again, use the log_probability()
method of the HMM to compute the total log-probability of test tokens.
(The input should be given as (token, None)
pairs.)
Consider the following sentences from Penn Treebank corpus:
s1 = ['So', 'far', 'Mr.', 'Hahn', 'is', 'trying', 'to', 'entice', 'Nekoosa', 'into', 'negotiating', 'a', 'friendly',
'surrender', 'while', 'talking', 'tough']
s2 = ['Despite', 'the', 'economic', 'slowdown', 'there', 'are', 'few', 'clear', 'signs', 'that', 'growth', 'is',
'coming', 'to', 'a', 'halt']
s3 = ['The', 'real', 'battle', 'is', 'over', 'who', 'will', 'control', 'that', 'market', 'and', 'reap',
'its', 'huge', 'rewards']
Annotate the sentences with appropriate POS tags. The tags are described here.
(It is not the aim of the exercise to annotate exactly according to guidelines, so simply make your best guess of the correct tag.)
The corresponding gold-standard tags of the sentences are below:
tags1 = ['IN', 'RB', 'NNP', 'NNP', 'VBZ', 'VBG', 'TO', 'VB', 'NNP', 'IN', 'VBG', 'DT', 'JJ', 'NN', 'IN', 'VBG', 'JJ']
tags2 = ['IN', 'DT', 'JJ', 'NN', 'EX', 'VBP', 'JJ', 'JJ', 'NNS', 'IN', 'NN', 'VBZ', 'VBG', 'TO', 'DT', 'NN']
tags3 = ['DT', 'JJ', 'NN', 'VBZ', 'IN', 'WP', 'MD', 'VB', 'DT', 'NN', 'CC', 'VB', 'PRP$', 'JJ', 'NNS']