Assignments for intensive NLP course, University of Helsinki
Carry out all the exercises below and submit your answers on Moodle.
Also submit a zip archive containing your code for exercises 1 and 2, and scorer outputs. Give the files meaningful names.
Temporal Information Extraction is finding temporal expressions in a natural language text. “Monday”, “last summer”, “next year”, “May 29, 2019”, “today” are all examples of temporal expressions.
This is an example code that uses a regular expression to find temporal expressions in the given sentences.
import re
sentences = [
"Waxman Industries Inc. said holders of $6,542,000 face amount of its 6 1/4% convertible subordinated debentures, due March 15, 2007, have elected to convert the debt into about 683,000 common shares.",
"Seventy-five million copies of the rifle have been built since it entered production in February 1947, making it history's most widely distributed weapon.",
"Many of the local religious leaders who led the 1992 protests have moved."
]
months = '(January|February|March|April|May|June|July|August|September|October|November|December)'
timex = r'((%s\s+)?(\d{1,2},?\s+)?\d{4})' % months
for s in sentences:
print (re.sub(timex, r'<TIMEX>\1</TIMEX>', s))
The output is the same sentences with temporal expressions marked up with <TIMEX>
tag:
Waxman Industries Inc. said holders of $6,542,000 face amount of its 6 1/4% convertible subordinated debentures, due <TIMEX>March 15, 2007</TIMEX>, have elected to convert the debt into about 683,000 common shares.
Seventy-five million copies of the rifle have been built since it entered production in <TIMEX>February 1947</TIMEX>, making it history's most widely distributed weapon.
Many of the local religious leaders who led the <TIMEX>1992</TIMEX> protests have moved.
The company said it expects to release third-quarter results in mid-November.
The thrift announced the plan Aug. 21.
The split and quarterly dividend will be payable Jan. 3 to stock of record Nov. 16, the company said.
Ogden Projects, whose shares began trading on the New York Stock Exchange in August, closed yesterday at $26.875, down 75 cents.
A spokeswoman for Crum amp Forster said employees were told early this week that numerous staff functions for the personal insurance lines were going to be centralized as a cost-cutting move.
For the quarter ended Sept. 30, Delta posted net income of $133.1 million, or $2.53 a share, up from $100 million, or $2.03 a share, a year earlier.
This is an output that should be produced:
The company said it expects to release third-quarter results in <TIMEX>mid-November</TIMEX>.
The thrift announced the plan <TIMEX>Aug. 21</TIMEX>.
The split and quarterly dividend will be payable <TIMEX>Jan. 3</TIMEX> to stock of record <TIMEX>Nov. 16</TIMEX>, the company said.
Ogden Projects, whose shares began trading on the New York Stock Exchange in <TIMEX>August</TIMEX>, closed <TIMEX>yesterday</TIMEX> at $26.875, down 75 cents.
A spokeswoman for Crum amp Forster said employees were told <TIMEX>early this week</TIMEX> that numerous staff functions for the personal insurance lines were going to be centralized as a cost-cutting move.
For <TIMEX>the quarter</TIMEX> ended <TIMEX>Sept. 30</TIMEX>, Delta posted net income of $133.1 million, or $2.53 a share, up from $100 million, or $2.03 a share, <TIMEX>a year earlier</TIMEX>.
Try to make your regular expressions as general as possible, so that they would capture not only given examples but also some other possible temporal expressions.
(Do not submit anything yet…)
The folder consists of two sub-folders: raw
that contains plain-text
documents, and ann
that contains the same documents manually
annotated with temporal expressions.
This is a script, that processes raw
documents one by one, annotates
temporal expressions in each of them using regular expressions and
output the result into a new folder called sub
.
Extend process_folder.py
script so that it capture more time
expressions. Use regular expressions made in Exercise 1.1.
Run your script on the training data like this:
python process_folder.py train/raw
Now you have an output folder train/sub
. The documents in the new
folder are named <no>_sub.txt
, where <no>
is the same document
number as in raw
folder.
The scorer takes three parameters: path to the gold annotations folder, path to the system output folder and (optional) name of the output. E.g.:
python scorer.py train/ann/ train/sub/ train.txt
The scorer outputs the evaluation measures—recall, precision and F1-score—which you should include into your report, and a detailed evaluations in a separate file.
Repeat that process as many times as you like until you are satisfied with the scores.
In this exercise we will solve the same problem using a pipeline of NLP tools.
First, the document is proccessed by spaCy to tokenize text and to assign each token with additional information, such as part of speech tags. Then, a matcher is used to find patterns in text. Patterns are similar to regular expressions but operate on tokens instead of characters.
python -m venv .env
source .env/bin/activate
pip install -U spacy
python -m spacy download en
See spaCy documentation for more details regarding installation.
Download example code spacy_timex.py.
This programme processes three sentences that there used in Exercise 1 and outputs the same result using spaCy pattern matcher.
You do not need to understand everything that is going on in the code. Focus on the main function and try to understand how spaCy patterns work.
For more details on the pattern syntax consult spaCy documentation.
This is a script, that process raw
documents one by one, annotates them with temporal expressions using spaCy pattern matcher and prints output to another folder called sub2
.
Extend process_folder_spacy.py
script so that it captures more
time expressions. Use patterns made in Exercise 2.1.
Run your script on the training data like this:
python process_folder_spacy.py train/raw
Now you have an output folder train/sub2
. The documents in the new
folder are named <no>_sub.txt
, where <no>
is the same document
number as in raw
folder.
Repeat that process as many times as you like, until you are satisfied with the scores.
Note that this process is a form of model training – you are updating the model’s parameters as you refine your matcher. Therefore, you use only the training set to do it, so that you are not influenced by the data in the development set.
Run your code and the scorer on the development set.
Now you have two time-expression annotators: regex-based and spaCy-based, and two scorer outputs for each annotator.
Compare scorer outputs obtained on development set. Try to find cases correctly processed by one annotator but missed by another.
What is the main strength of the spaCy annotator? What is its main weakness? Add some examples to justify your response.
The data used in this assignment are taken from TempEval-3 Temporal Annotation Shared Task. The shared task used much mpre elaborated annotation schema and consisted of several sub-tasks. More details on the tasks and the results can be found in the organizers’ paper.