Assignments for intensive NLP course, University of Helsinki
Carry out all the exercises below and submit your answers on Moodle. Also submit a single Python file containing your full implementation.
NLTK contains a method for loading a CFG from a string. Here, for example, is the small CFG given in the lecture, specified in the format NLTK can load.
cfg_rules = """
S -> NP VP
NP -> Det N | PropN
Det -> PosPro | Art
VP -> Vt NP
Art -> 'the' | 'a'
PropN -> 'Alice'
N -> 'duck' | 'telescope' | 'park'
Vt -> 'saw'
PosPro -> 'my' | 'her'
"""
cfg = nltk.CFG.fromstring(cfg_rules)
This grammar is almost in Chomsky Normal Form.
The only respect in which it diverges is that it
contains ‘unary’ rules, like ‘NP -> PropN’.
The version of CKY shown in the lecture permits
these and NLTK’s is_flexible_chomsky_normal_form()
method does too.
print(small_cfg.is_flexible_chomsky_normal_form())
The NLTK CFG type has a method to check that all the words of the input sentence are covered by lexical rules in the grammar. Check now that you’ve got lexical rules in your grammar for the two example sentences.
# Check that all the words of the input sentence are covered
sentences = [
"the purchase price includes two ancillary companies .".split(),
"the guild began a strike against the TV and movie industry in March 1988 .".split(),
]
for s in sentences:
grammar.check_coverage(s)
Add some more rules to your grammar so that it can also generate the sentence:
“the guild bought one ancillary company .”
So far, we’ve only run sanity checks that the words of sentences are covered by the grammar. We haven’t yet used the grammar to parse the sentences. NLTK includes implementations of a number of different parsing algorithms, including the bottom-up chart parsing algorithm introduced in the lecture (CKY).
The example trees include nodes with more than two children (e.g. the NP covering “the TV and movie industry”). This causes problems for the parsing algorithm, but any CFG can be converted to Chomsky normal form (CNF) without changing the sentences it generates.
For example, a rule
A -> B C D
can be replaced by
A -> B2 D
B2 -> B C
where B2
is a new non-terminal.
cfg.is_flexible_chomsky_normal_form()
.Load a bottom-up chart parser and initialize it with your CNF grammar:
from nltk.parse.chart import BottomUpChartParser
parser = BottomUpChartParser(cnf_grammar)
parses = list(parser.parse(sentence))
The results are tree structures.
If your grammar is correct, you should get at least one full parse for each of the example sentences in exercises 1 and 2, repeated here:
the purchase price includes two ancillary companies .
the guild began a strike against the TV and movie industry in March 1988 .
the guild bought one ancillary company .
Confirm that the trees produced by the parser match the two example trees (with the exception of the additional nodes added in normalization of the grammar).
(You can use the method parse.draw()
to display the
parse result graphically.)
NLTK provides easy access to a 10% sample of the Penn Treebank. The full treebank is not available without a license, but this sample is enough for us to build a treebank grammar from.
Start by loading the treebank as follows and taking a look at a couple of its parse trees, which are instances of NLTK’s Tree class:
from nltk.corpus import treebank
print(treebank.parsed_sents()[0])
print(treebank.parsed_sents()[1])
Or, graphically:
treebank.parsed_sents()[0].draw()
The trees in the corpus are represented using NLTK’s own data structures, including:
The same data structures (classes) are used the represents NTs and
productions in the grammars you created above. When you called
CFG.fromstring()
, the result was a
CFG object,
which contained Nonterminal
s, Production
s and strings defining
the CFG (see lectures).
The complete set of productions used in a parse tree is directly
available through its tree.productions()
method.
Given a set of productions using these NLTK data structures, you can directly build a CFG as follows:
cfg = nltk.CFG(nltk.Nonterminal("S"), productions)
This defines the S
non-terminal as the start symbol of
the grammar. The set of non-terminals and terminals will be all
of those used in the list of productions.
Build a treebank grammar from all the trees in this sample of the corpus.
Use your grammar as you did in exercise 4 to parse the following sentences.
Mr. Vinken is chairman .
Stocks rose .
Alan introduced a plan .
You will have noticed that your treebank parser produced a huge number of parse trees for even very short sentences. Most of these are highly implausible, resulting either from overgeneration by the grammar or from a high level of local ambiguity that could be reasonably well ruled out once the rest of the sentence is taken into account.
In practice, exhaustive parsing of long sentences becomes completely impractical.
We will now use the treebank to create a PCFG, learning the grammar from the corpus, as above, and estimating the probabilities associated with productions from the same data.
Begin by collecting counts of the many expansions of an S
non-terminal and using these to estimate a probability
distribution for S -> ?
rules.
NLTK provides a tool to estimate all the probabilities of a PCFG from the productions in a treebank.
from nltk import induce_pcfg
pcfg = induce_pcfg(Nonterminal("S"), productions)
The probabilities are computed in the same way you did in the previous exercise.
NLTK’s InsideChartParser
provides a probabilistic version
of the chart parsing algorithm you used above. It has
a beam_size
parameter, allowing you to perform
beam search to speed up parsing.
Try parsing the example sentences above, this time with
the PCFG. Experiment with the beam_size
parameter.
The first parse is now that favoured by the statistical model and should therefore look much more reasonable than a randomly chosen example from the exhaustive parse. Take a look at the top parse(s) and see what you think.
Now we’ll try parsing some data that wasn’t in the training corpus. Your parser can only process sentences made up of words it has seen before, since it has no mechanism for guessing what rules to use for unseen words. It will for the same reason struggle to handle grammatical constructions that differ from those in the training corpus.
Try feeding some sentences into the PCFG parser, as in ex 5, to see if you can find any full parses. You could try coming up with sentences yourself, or taking them from some other source, like news articles.
They will need to be tokenized
in the same style as the Penn Treebank. You can either do that
manually or use NLTK’s TreebankWordTokenizer
, which
produces PTB-style tokenization.