NLP 2020

Assignments for NLP course, University of Helsinki

Week 7: Temporal Information Extraction

Carry out all the exercises below and submit your answers on Moodle.

Also submit a zip archive containing your code for exercises 1 and 2, and scorer outputs. Give the files meaningful names.


Temporal Information Extraction is finding temporal expressions in a natural language text. “Monday”, “last summer”, “next year”, “May 29, 2019”, “today” are all examples of temporal expressions.

Exercise 1: Temporal IE with regural expressions

This is an example code that uses a regular expression to find temporal expressions in the given sentences.

import re

sentences = [
"Waxman Industries Inc. said holders of $6,542,000 face amount of its 6 1/4% convertible subordinated debentures, due March 15, 2007, have elected to convert the debt into about 683,000 common shares.",
"Seventy-five million copies of the rifle have been built since it entered production in February 1947, making it history's most widely distributed weapon.",
"Many of the local religious leaders who led the 1992 protests have moved."
]

months = '(January|February|March|April|May|June|July|August|September|October|November|December)'
timex = r'((%s\s+)?(\d{1,2},?\s+)?\d{4})' % months

for s in sentences:
    print (re.sub(timex, r'<TIMEX>\1</TIMEX>', s))

The output is the same sentences with temporal expressions marked up with <TIMEX> tag:

Waxman Industries Inc. said holders of $6,542,000 face amount of its 6 1/4% convertible subordinated debentures, due <TIMEX>March 15, 2007</TIMEX>, have elected to convert the debt into about 683,000 common shares.
Seventy-five million copies of the rifle have been built since it entered production in <TIMEX>February 1947</TIMEX>, making it history's most widely distributed weapon.
Many of the local religious leaders who led the <TIMEX>1992</TIMEX> protests have moved.

Exercise 1.1

The company said it expects to release third-quarter results in mid-November.

The thrift announced the plan Aug. 21.

The split and quarterly dividend will be payable Jan. 3 to stock of record Nov. 16, the company said.

Ogden Projects, whose shares began trading on the New York Stock Exchange in August, closed yesterday at $26.875, down 75 cents.

A spokeswoman for Crum amp Forster said employees were told early this week that numerous staff functions for the personal insurance lines were going to be centralized as a cost-cutting move.

For the quarter ended Sept. 30, Delta posted net income of $133.1 million, or $2.53 a share, up from $100 million, or $2.03 a share, a year earlier.

This is an output that should be produced:

The company said it expects to release third-quarter results in <TIMEX>mid-November</TIMEX>.

The thrift announced the plan <TIMEX>Aug. 21</TIMEX>.

The split and quarterly dividend will be payable <TIMEX>Jan. 3</TIMEX> to stock of record <TIMEX>Nov. 16</TIMEX>, the company said.

Ogden Projects, whose shares began trading on the New York Stock Exchange in <TIMEX>August</TIMEX>, closed <TIMEX>yesterday</TIMEX> at $26.875, down 75 cents.

A spokeswoman for Crum amp Forster said employees were told <TIMEX>early this week</TIMEX> that numerous staff functions for the personal insurance lines were going to be centralized as a cost-cutting move.

For <TIMEX>the quarter</TIMEX> ended <TIMEX>Sept. 30</TIMEX>, Delta posted net income of $133.1 million, or $2.53 a share, up from $100 million, or $2.03 a share, <TIMEX>a year earlier</TIMEX>.

Try to make your regular expressions as general as possible, so that they would capture not only given examples but also some other possible temporal expressions.

(Do not submit anything yet…)

Exercise 1.2

The folder consists of two sub-folders: raw that contains plain-text documents, and ann that contains the same documents manually annotated with temporal expressions.

This is a script, that processes raw documents one by one, annotates temporal expressions in each of them using regular expressions and output the result into a new folder called sub.

python process_folder.py train/raw

Now you have an output folder train/sub. The documents in the new folder are named <no>_sub.txt, where <no> is the same document number as in raw folder.

The scorer takes three parameters: path to the gold annotations folder, path to the system output folder and (optional) name of the output. E.g.:

python scorer.py train/ann/ train/sub/ train.txt

The scorer outputs the evaluation measures—recall, precision and F1-score—which you should include into your report, and a detailed evaluations in a separate file (if you used the example above the name of this file should be train.txt).

Repeat that process as many times as you like until you are satisfied with the scores.

Exercise 2: Temporal IE with spaCy

In this exercise we will solve the same problem using a pipeline of NLP tools.

First, the document is proccessed by spaCy to tokenize text and to assign each token with additional information, such as part of speech tags. Then, a matcher is used to find patterns in text. Patterns are similar to regular expressions but operate on tokens instead of characters.

python -m venv .env
source .env/bin/activate
pip install -U spacy
python -m spacy download en

See spaCy documentation for more details regarding installation.

Exercise 2.1

Download example code spacy_timex.py.
This programme processes three sentences that there used in Exercise 1 and outputs the same result using spaCy pattern matcher.

You do not need to understand everything that is going on in the code. Focus on the main function and try to understand how spaCy patterns work.

For more details on the pattern syntax consult spaCy documentation.

Exercise 2.2

This is a script, that process raw documents one by one, annotates them with temporal expressions using spaCy pattern matcher and prints output to another folder called sub2.

python process_folder_spacy.py train/raw

Now you have an output folder train/sub2. The documents in the new folder are named <no>_sub.txt, where <no> is the same document number as in raw folder.

Repeat that process as many times as you like, until you are satisfied with the scores.

Note that this process is a form of model training – you are updating the model’s parameters as you refine your matcher. Therefore, you use only the training set to do it, so that you are not influenced by the data in the development set.

Exercise 3: Model comparison

Now you have two time-expression annotators: regex-based and spaCy-based, and two scorer outputs for each annotator.

Compare scorer outputs obtained on development set. Try to find cases correctly processed by one annotator but missed by another.

Add some examples to justify your response.

Acknowledgements

The data used in this assignment are taken from TempEval-3 Temporal Annotation Shared Task. The shared task used much mpre elaborated annotation schema and consisted of several sub-tasks. More details on the tasks and the results can be found in the organizers’ paper.