Assignments for NLP course, University of Helsinki
Carry out all the exercises below and submit your answers on Moodle. Also submit a single Python file containing your full implementation.
In this exercise, we will load a corpus annotated manually with POS tags. NLTK provides easy access to the MASC corpus, a part of the Open American National Corpus (OANC) that has been annotated with various linguistic analyses.
A POS-tagged corpus like this can be used to estimate the parameters of an HMM, as seen in the lectures. The trained model can then be used to POS tag new sentences.
We will begin by computing some of the parameters of an HMM using Maximum Likelihood Estimation (MLE) based on counts from the corpus. (See lecture slides for more explanation of this.)
Start by downloading the MASC corpus:
nltk.download("masc_tagged")
You can load data from the corpus like this:
from nltk.corpus import masc_tagged
print(masc_tagged.tagged_sents()[0])
print(masc_tagged.tagged_sents()[1])
Each sentence consists of (word, tag) pairs:
[('Good', 'JJ'), ('evening', 'NN'), ... ]
Recall that the HMM contains two distributions: the transition
distribution between tags and the emission distribution
from tags to words.
By iterating over the data in MASC, collect the counts needed
to estimate the transition distribution from the
verb tag (VB
) to all other tags (i.e. p(t[i+1] | t[i] = VB)).
Also collect counts needed to estimate the emission distribution
for the VB
tag (i.e. p(w[i] | t[i] = VB)).
Compute both of these distributions.
Collect these counts and estimate the distributions for
VB
using MLE.Submit the computed p(t[i+1] = DT | t[i] = VB) - the probability of a verb being followed by a determiner. Submit the computed p(w[i] = ‘feel’ | t[i] = VB).
NLTK contains a function to collect all the necessary counts to
train an HMM using MLE, exactly as you have done above.
The class nltk.tag.hmm.HiddenMarkovModelTagger
implements
HMM training and tagging.
Use the function
HiddenMarkovModelTagger.train()
to estimate all probabilities for an HMM POS tagger from the MASC corpus.
Your HMM can now be used to tag new sentences:
tagged_sent = hmm.tag(
["The", "answer", "is", "blowing", "in", "the", "wind", "."]
)
The POS tag set used by this corpus is the same as the Penn Treebank. You can find a description of each tag here.
Try tagging the following sentences:
Once we have finished , we will go out .
There is always room for more understanding between warring peoples .
Evidently , this was one of Jud 's choicest tapestries , for the noble emitted a howl of grief and rage and leaped from his divan .
The second example contains the ambiguous word understanding, which could take several different tags. Look at how your tagger handled it.
Take a look at how the tagger behaves when it sees previously unseen words. Jud (in the third example) is not in the training corpus, but the tagger makes a guess anyway. Although it has no tags with a non-zero p(w=’Jud’|t), it is able to choose a tag that fits well in the context, thanks to the transition distribution.
Try some more examples with unseen words and observe how the tagger manages. Here are a couple to get you started:
Misjoggle in a gripty hifnipork .
One fretigy kriptog is always better than several intersplicks .
- What tag did the tagger assign to understanding? Is this correct?
- Give one example of a sentence with unseen words where the tagger, in your opinion, picks the correct tags for the unseen words.
- Give one example where the tagger is thrown off by an unseen word. What information might help it make a better guess? (Other than seeing the word in its training set!)
- Submit your answers as text
There are many well studied algorithms for working with HMMs: performing efficient inference (tagging), training in ways that deal better with sparse or unseen data and unsupervised training.
Here we will use an unsupervised training method let our HMM benefit from having seen more data. NLTK provides an implementation of the Baum-Welch algorithm, an instance of the Expectation-Maximization (EM) algorithm.
Baum-Welch is able to infer a tagging model with no labelled data at all. A downside of this is that the tags do not necessarily have any correspondence to real POS tags. Here we will train on the labelled data we used above, which will be used to initialize a model, and a further unlabelled set. The algorithm iteratively makes guesses as to the POS tags for words and updates the model’s probability distributions from these guesses.
Download the text file radio_planet_tokens.txt
This contains tokenized text from the book Radio Planet, by Ralph Milne Farley. This is quite a different domain to the original (labelled) training data. Load text from the file: one sentence per line, with tokens separated by spaces.
Download the Python file w3utils.py and put it in the same directory as your Python source.
The function train_unsupervised()
provides a small wrapper around
NLTK’s training to ensure that the unsupervised model is expanded to
cover words unseen in the labelled data. (This is a form of
semi-supervised learning.)
Train an HMM using the previous labelled data as well as the new raw data.
This could take a bit of time to run (for me it took 1-2m per iteration).
The default number of iterations is 3, but, if you have time, you might
like to try increasing this
(using the max_iterations
kwarg to train_unsupervised()
).
Training this model could take up to 15m, depending on the parameters you set and the computer you’re running on. You will reuse it in the following exercises.
It’s a good idea to store your trained model (returned by
train_unsupervised()
) to a file using Python’spickle
library and then load it again on subsequent runs, so you don’t have to re-train it every time you run the later exercises.
Try tagging the earlier example sentences with the new model.
Also try tagging some new sentences from later in Radio Planet:
Yesterday these fiends operated upon Doggo .
For a time, his own soul and this brain - maggot struggled for supremacy .
- Has the unlabelled data improved the tagger?
- What else could you do to improve the tagger’s performance on different domains?
- Submit your answers
Try feeding some other sentences into the POS taggers, comparing the output from the supervised and semi-supervised models. Try out sentences from a number of different domains: e.g. fiction, news, legal documents, …
Submit a few sentences reporting your observations
The HMM class has a method
log_probability()
that uses the full
generative model to assign a probability to a given input sentence.
This allows you to use your trained model as a language model.
The model is similar to the Markov language model seen in the
lectures, except that word probabilities are conditioned on (unknown,
inferred) POS tags.
The
log_probability()
method expects a list of(word, pos_tag)
pairs as input. You don’t have to specify the POS tag (it will sum over all possible tags if not given), but you still have to give it(word, None)
pairs.
Try using your models as LMs. Measure the log probability of some short sentences (perhaps including some of the examples above) using both of your HMMs. Also estimate the probability of some nonsense sentences you make up: use real words that are likely to be covered by the model.
- Do the nonsense sentences tend to receive lower probabilities than the real ones?
- Is the semi-supervised model better at distinguishing between real and nonsense sentences?
- Submit your answers
Compute the perplexities of your language models using the same test set as in last week’s exercise 3, from the Penn Treebank corpus:
from nltk.corpus import treebank
from w3utils import split_corpus
_, test_sents = split_corpus(treebank, 0.8)
test_set = [(t[0], None) for s in test_sents for t in s]
(Check that w3utils.py is in the same directory as your Python source.)
- What are perplexities of your models (round the values to two decimal places)?
- What does the perplexity of a language model describe? Explain briefly.
- How could you find out whether a language model is ‘good’? Explain briefly.
- Submit the perplexities and your explanations
The HMM provides a simple method to sample random sentences from the
model,
random_sample(rng, length)
.
It requires a random number
generator (you can use Python’s random
module) and a length for
the sentence.
Try generating some sentences from your HMMs.
- Do they look like real sentences?
- Why are they (usually) incoherent?
- Why don’t they look like the sentences in the training corpus?
- Is the unsupervised model better?
- Submit short answers to these questions
How might you improve on the sentence generator to give the generated sentences more coherence?
Don’t implement anything: just discuss (briefly).
Submit your discussion