Return to basic course information.
Assigned: Friday, 24 January 2014
Due: 11:59pm, Friday, 14 February 2014 (extended from 7 Feb.)
Writing Regular Expressions [10 points]
Download the some example books from Project Gutenberg that are included with the Natural Language Toolkit: http://nltk.googlecode.com/svn/trunk/nltk_data/packages/corpora/gutenberg.zip.
When you unzip the file (with, e.g., unzip
on
Linux and Mac OS X), you should get a directory with several
plain text .txt
files.
This exercise is open ended. Use your favorite programming
language, or the Unix grep
utility, or the following
two utility programs, regexs.py
and regexcount.py
, to explore this corpus. If you use
the python code, the first argument is a regular expression and
the rest are files. For instance, you could run
./regexs.py ' [A-Za-z]{20,} ' gutenberg/*.txtto search all the Project Gutenberg texts for words over 20 characters long. Note the use of quotes to ensure that the spaces are interpreted as part of the regular expression.
Find interesting patterns, and write up the regexes that describe those patterns, some example results, and an explanation for why they're interesting. Some examples include: common morphological suffixes, patterns of verbs that introduce dialogue in novels, patterns that indicate proper names, patterns that indicate verbs.
Compressed English [30 points]
In this problem and the following ones, you will be using the
open-source OpenFst
package from Google. This package does not have as much of
the nice regular expression syntactic sugar as the
Xerox xfst
toolkit we read about, but it does
support weighted finite-state machines. The weights, in
this case probabilities on the arcs, will be important in deciding
between competing paths.
You can either follow instructions for downloading and
compiling the tools on your own machine, or you can use the
pre-compiled versions on login.ccs.neu.edu
. To do
that, modify your PATH environment as follows:
export PATH=/home/dasmith/bin:$PATH
You will find it very helpful to read through the examples on the OpenFst site.
The OpenFst toolkit, in addition to C++ libraries, provides command line tools for manipulating finite-state acceptors and transducers. Each tool performs a basic operation, like taking the union of two machines, or inverting a transducers, or the composition of two transducers.
Two noticeable—perhaps noticeably annoying—features of the toolkit is that you need to compile a text specification of a machine into a binary file format and that for speed purposes, all symbols in the input and output alphabets must be interned. Interning is a common trick for speeding up software that deals with language data. To avoid constantly comparing strings, which might be of arbitrary length, systems hash all unique strings to integers and then just compare integers in constant time. During composition, for instance, where we need to see if the output on an arc in the first transducer matches in the input on an arc in the second, this trick speeds things up a lot. What this means for you is that you need to specify up-front what the input and output alphabets are. If you don't specify them, you may print out a machine only to see the integers for symbols instead of nice human-readable strings.
For this problem, you will be working with n-gram models of characters rather than of words. This makes the models smaller and your machines easier to write by hand. It also means we can just use the symbol table consisting of the fixed ASCII character set for most of the exercises.
Were the Greeks right to invent vowels? What if English, like Hebrew, Aramaic, Arabic, and some other languages, left vowels to be supplied by the reader?
Write a transducer that removes the
vowels aeiou
(not y
) from any string
of the 26 lowercase letters and space. (Note: By convention,
OpenFst uses <space>
, including those angle
brackets, for the space character and
uses <epsilon>
for the empty string.)
Specify your transducer in text form in a file
called unvowel.txt
. Compile it into binary form
with the command:
fstcompile --isymbols=ascii.syms --osymbols=ascii.syms < unvowel.txt > unvowel.fst
I've compiled a finite-state acceptor for the first 100 or
so characters of the Declaration of Independence
into declaration.fst
. You can print it out
like so:
$ fstprint --isymbols=ascii.syms --osymbols=ascii.syms declaration.fst 0 1 w w 1 2 h h 2 3 e e 3 4 n n 4 5 <space> <space> 5 6 i i 6 7 n n 7 8 <space> <space> 8 9 t t 9 10 h h 10 11 e e 11 12 <space> <space> ...If you compose
unvowel.txt
with this string,
project to the lower language, and remove unnecessary
epsilon transitions, you should see this:
$ fstcompose declaration.fst unvowel.fst | fstproject --project_output | fstrmepsilon | fstprint --isymbols=ascii.syms --osymbols=ascii.syms 0 1 w w 1 2 h h 2 3 n n 3 4 <space> <space> 4 5 n n 5 6 <space> <space> 6 7 t t 7 8 h h 8 9 <space> <space> ...
The unvowel
transducer is a
deterministic function and could easily have been
written in your favorite programming language. Where
finite-state methods really shine is in the ability to
automatically manipulate these machines. You can
trivially invert your transducer, for
instance, turning the output to the input,
e.g. fstinvert unvowel.fst
. This inverted
transducer would take unvowelled text and
nondeterministically insert vowels into it. Since there are
an infinite number of ways to insert vowels, we need some
criterion of deciding which re-vowelings are best.
This is where weighted (character) n-gram language models
come in. We've provided models for n = 1, 2, 3, 5, and 7
as c1.fst
, c2.fst
, and so on.
Compose your inverted unvoweler with each of these language
models to
produce revowel1.fst
, revowel2.fst
,
and so on.
The final pipeline for unvoweling and revoweling with a 7-gram model will look like:
fstcompose declaration.fst unvowel.fst | fstproject --project_output | fstrmepsilon | fstcompose - revowel7.fst | \ fstshortestpath | fstproject --project_output | fstrmepsilon | fsttopsort | \ fstprint --isymbols=ascii.sym --osymbols=ascii.sym
For each of the n-gram levels 1, 2, 3, 5, and 7, how many words have been correctly revoweled? What's wrong with n=1?
Perhaps removing all vowels was a bit drastic. Many
writing systems will also indicate initial vowels, final
vowels, or perhaps all long vowels. (Note also the
psycholinguistic research showing that text is still quite
legible if the first and last characters of words are intact
and other characters are removed.) You should write another
transducer, called squash.txt
, to leave the
first and last characters of each word alone, and remove
only internal vowels. The first three words would
thus be Whn in the
.
Construct similar unsquashing transducers for
the squash
transformation at all n-gram levels
as above. How many words have been correctly
unsquashed?
Extra credit: [10 points] The OpenFst example page explains how to build a transducer to compute edit distance between two strings. In addition to the word-level accuracies we ask for above, compute character level accuracies.
Prounouncing Numbers [30 points]
Create a transducer that maps numbers in the range 0 - 999,999,999 represented as digit strings to their English read form, e.g.:
Note: The input symbols can be the same ASCII character set as above, single digits are in ASCII, but the output symbols should be whole English words like "hundred" and "eleven".
Hint: Nondeterminism and epsilon outputs are helpful here, though this is still a one-to-one function.