CS6120: Natural Language Processing

Homework 1

Return to basic course information.

Assigned: Friday, 24 January 2014
Due: 11:59pm, Friday, 14 February 2014 (extended from 7 Feb.)

Instructions

Apart from third-party software and data, files used for this assignment are available for download.
If you collaborated with others, you must write down with whom you worked on the assignment. If this changes from problem to problem, then you should write down this information separately with each problem.
Email the TA the requested written answers, code, and instructions on how to (compile and) run the code.

Problems

Writing Regular Expressions [10 points]

Download the some example books from Project Gutenberg that are included with the Natural Language Toolkit: http://nltk.googlecode.com/svn/trunk/nltk_data/packages/corpora/gutenberg.zip.

When you unzip the file (with, e.g., unzip on Linux and Mac OS X), you should get a directory with several plain text .txt files.
This exercise is open ended. Use your favorite programming language, or the Unix grep utility, or the following two utility programs, regexs.py and regexcount.py, to explore this corpus. If you use the python code, the first argument is a regular expression and the rest are files. For instance, you could run
```
    ./regexs.py ' [A-Za-z]{20,} ' gutenberg/*.txt
```
to search all the Project Gutenberg texts for words over 20 characters long. Note the use of quotes to ensure that the spaces are interpreted as part of the regular expression.
Find interesting patterns, and write up the regexes that describe those patterns, some example results, and an explanation for why they're interesting. Some examples include: common morphological suffixes, patterns of verbs that introduce dialogue in novels, patterns that indicate proper names, patterns that indicate verbs.
Compressed English [30 points]

In this problem and the following ones, you will be using the open-source OpenFst package from Google. This package does not have as much of the nice regular expression syntactic sugar as the Xerox xfst toolkit we read about, but it does support weighted finite-state machines. The weights, in this case probabilities on the arcs, will be important in deciding between competing paths.

You can either follow instructions for downloading and compiling the tools on your own machine, or you can use the pre-compiled versions on login.ccs.neu.edu. To do that, modify your PATH environment as follows:
```
export PATH=/home/dasmith/bin:$PATH
```
You will find it very helpful to read through the examples on the OpenFst site.

The OpenFst toolkit, in addition to C++ libraries, provides command line tools for manipulating finite-state acceptors and transducers. Each tool performs a basic operation, like taking the union of two machines, or inverting a transducers, or the composition of two transducers.

Two noticeable—perhaps noticeably annoying—features of the toolkit is that you need to compile a text specification of a machine into a binary file format and that for speed purposes, all symbols in the input and output alphabets must be interned. Interning is a common trick for speeding up software that deals with language data. To avoid constantly comparing strings, which might be of arbitrary length, systems hash all unique strings to integers and then just compare integers in constant time. During composition, for instance, where we need to see if the output on an arc in the first transducer matches in the input on an arc in the second, this trick speeds things up a lot. What this means for you is that you need to specify up-front what the input and output alphabets are. If you don't specify them, you may print out a machine only to see the integers for symbols instead of nice human-readable strings.

For this problem, you will be working with n-gram models of characters rather than of words. This makes the models smaller and your machines easier to write by hand. It also means we can just use the symbol table consisting of the fixed ASCII character set for most of the exercises.
1. Were the Greeks right to invent vowels? What if English, like Hebrew, Aramaic, Arabic, and some other languages, left vowels to be supplied by the reader?
  
  Write a transducer that removes the vowels aeiou (not y) from any string of the 26 lowercase letters and space. (Note: By convention, OpenFst uses <space>, including those angle brackets, for the space character and uses <epsilon> for the empty string.)
  
  Specify your transducer in text form in a file called unvowel.txt. Compile it into binary form with the command:
```
fstcompile --isymbols=ascii.syms --osymbols=ascii.syms < unvowel.txt > unvowel.fst
	  
```
  I've compiled a finite-state acceptor for the first 100 or so characters of the Declaration of Independence into declaration.fst. You can print it out like so:
```
$ fstprint --isymbols=ascii.syms --osymbols=ascii.syms declaration.fst 
0       1       w       w
1       2       h       h
2       3       e       e
3       4       n       n
4       5       <space> <space>
5       6       i       i
6       7       n       n
7       8       <space> <space>
8       9       t       t
9       10      h       h
10      11      e       e
11      12      <space> <space>
...
	  
```
  If you compose unvowel.txt with this string, project to the lower language, and remove unnecessary epsilon transitions, you should see this:
```
$ fstcompose declaration.fst unvowel.fst | fstproject --project_output | fstrmepsilon | fstprint --isymbols=ascii.syms --osymbols=ascii.syms 
0       1       w       w
1       2       h       h
2       3       n       n
3       4       <space> <space>
4       5       n       n
5       6       <space> <space>
6       7       t       t
7       8       h       h
8       9       <space> <space>
...
	  
```
2. The unvowel transducer is a deterministic function and could easily have been written in your favorite programming language. Where finite-state methods really shine is in the ability to automatically manipulate these machines. You can trivially invert your transducer, for instance, turning the output to the input, e.g. fstinvert unvowel.fst. This inverted transducer would take unvowelled text and nondeterministically insert vowels into it. Since there are an infinite number of ways to insert vowels, we need some criterion of deciding which re-vowelings are best.
  
  This is where weighted (character) n-gram language models come in. We've provided models for n = 1, 2, 3, 5, and 7 as c1.fst, c2.fst, and so on. Compose your inverted unvoweler with each of these language models to produce revowel1.fst, revowel2.fst, and so on.
  
  The final pipeline for unvoweling and revoweling with a 7-gram model will look like:
```
fstcompose declaration.fst unvowel.fst | fstproject --project_output | fstrmepsilon | fstcompose - revowel7.fst | \
    fstshortestpath | fstproject --project_output | fstrmepsilon | fsttopsort | \
    fstprint --isymbols=ascii.sym --osymbols=ascii.sym
	  
```
  For each of the n-gram levels 1, 2, 3, 5, and 7, how many words have been correctly revoweled? What's wrong with n=1?
3. Perhaps removing all vowels was a bit drastic. Many writing systems will also indicate initial vowels, final vowels, or perhaps all long vowels. (Note also the psycholinguistic research showing that text is still quite legible if the first and last characters of words are intact and other characters are removed.) You should write another transducer, called squash.txt, to leave the first and last characters of each word alone, and remove only internal vowels. The first three words would thus be Whn in the.
4. Construct similar unsquashing transducers for the squash transformation at all n-gram levels as above. How many words have been correctly unsquashed?
5. Extra credit: [10 points] The OpenFst example page explains how to build a transducer to compute edit distance between two strings. In addition to the word-level accuracies we ask for above, compute character level accuracies.
Prounouncing Numbers [30 points]

Create a transducer that maps numbers in the range 0 - 999,999,999 represented as digit strings to their English read form, e.g.:
- 1 -> one
- 1 1 -> eleven
- 1 1 1 -> one hundred eleven
- 1 1 1 1 -> one thousand one hundred eleven
- 1 1 1 1 1 -> eleven thousand one hundred eleven
Note: The input symbols can be the same ASCII character set as above, single digits are in ASCII, but the output symbols should be whole English words like "hundred" and "eleven".

Hint: Nondeterminism and epsilon outputs are helpful here, though this is still a one-to-one function.