DISCO: Corpus preprocessing
This page describes the preprocessing steps that have been applied to the
DISCO corpora.
There are three steps that are always applied:
- Tokenization
- Removal of stop words. Stopword lists (only for some language data packets) are given on the packet description page.
- Removal of low frequency words. Depending on corpus size, "low" means a frequency of less than 20-100 occurrences. See the packet description for the frequency threshold used in each package.
The other steps are indicated in the name of the language data packets. These steps are:
- lemma: Lemmatization
- mwl: Identification of multi-word lexemes
- lc: all words are converted to lower case
These steps are described in more detail in the following.
Lemmatization
'lemma' indicates that the word space has been built on a corpus where all word forms have been substituted by their base forms (lemmata). You can only search for base forms, and you will only find base forms as similar words. However, since the lemmatizer does not know all words, you will occasionally find inflected forms also.
Identification of multi-word lexemes
'mwl' stands for multi-word lexemes (MWL). These are multi-token words where the white space between tokens has been replaced by underscore, e.g. New_York_City. Word spaces marked with 'mwl' contain a number of multi-word lexemes that have been annotated in the corpus before word space construction. For the English Wikipedia language data packet we combined all multi-word lexemes that are listed in the SPECIALIST lexicon with all article names from the English Wikipedia. We filtered the resulting list by removing all MWLs that ocur less than 50 times in the Wikipedia corpus. Additionally, we filtered by part-of-speech patterns, i.e. we only kept words that match a phrasal verb or a noun phrase pattern like one of the following:
ADJ N, N N, ADJ ADJ N, ADJ N N, N N N, N PREP N, N POS N, N THE N, ...
In the end, we had some 50,000 multi-word lexemes that we annotated in the corpus by converting them to single tokens by replacing the spaces between the parts by underscores.
Since all MWLs contain at least one underscore character ('_'), you can output a
list of all MWLs that are stored in a given language data packet by typing the
following commands (on Linux/Unix, if you are on Windows, install Cygwin):
java -jar disco-1.4.jar LANGUAGE_DATA_PACKET_DIR -wl WORDLIST_FILE
grep "_" WORDLIST_FILE
The first command writes a list of all words in the LANGUAGE_DATA_PACKET into the
file WORDLIST_FILE. The second command scans the WORDLIST_FILE and prints all words
that contain an underscore.
Lowercase
'lc' indicates that all words in the corpus were converted to lower case before word
space computation. You can only search for lower case words, and you will only get lower
case words as similar words. For example:
java -jar disco-1.4.jar enwiki-20130403-sim-lemma-mwl-lc -bn new_york 4
new_york_city 0.9175
n.y. 0.7447
boston 0.6592
chicago 0.6276
java -jar disco-1.4.jar enwiki-20130403-sim-lemma-mwl-lc -bn New_York 4
The word "New_York" was not found.