DISCO - Description of the language data packets and download

Arabic

Packet name: ar-general-20120124
Packet size: 518 megabytes
Corpus size: 188 million token
Number of queriable words: 134,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words, deletion of all words with frequency lower than 50.

Stopword list: stopword-list_ar_utf8.txt

Parameters used for word space computation: Context window +-3 words regarding exact position, 50,000 most frequent lemmata as features, significance measure from Kolb 2009 with threshold 0.1, similarity measure from Lin 1998.
Corpus:

Arabic Wikipedia (XML dump 2012-01-14)
Ajdir Corpora (online newspapers)

Download and installation:

Download the archive ar-general-20120124.tar and unpack it (enter disco as password).

German

Packet name: de-general-20131219-sim
Packet size: 2.2 gigabytes
Corpus size: 977,330,652 tokens
Number of queriable words: 246,119
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, deletion of stop words, removal of all words with a frequency lower than 100.

Stopword list: stopword-list_de_utf8.txt

Parameters used for word space computation: Context window +-3 words regarding exact position, 30,000 most frequent words as features, significance measure from Kolb 2009 with threshold 0.1, similarity measure from Lin 1998.
Corpus:

the German Wikipedia, dump from 12.12.2011 (569 million tokens)
newspapers and magazines (297 million tokens)
parliamentary debates (64 million tokens)
fiction (31 million tokens)
movie and TV subtitles (14 million tokens)

Download and installation:

Download the archive de-general-20131219-sim.tar.bz2 and unpack it (use disco as password).
After you have unpacked the archive there should be a directory named de-general-20131219-sim. Do not rename or edit any of the files in the directory! (You may change the name of the directory.)

Packet name: de-general-20080727
Packet size: 3.6 Gigabyte
Corpus size: 400 million token
Number of queriable words: 200,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words.
Corpus:

Encyclopedia (273 million token)
Newspaper (48 million token)
Periodicals (32 million token)
Parliamentary debates (27 million token)
Literature: Fiction and Non-fiction (20 million token)

Download and installation:
Please note that the commercial usage of this language data packet is not allowed! (More information here.)

Download the archive de-general-20080727.tbz2 and unpack it (the password is disco).

English

Packet name: enwiki-20130403-sim-lemma-mwl-lc
Packet size: 2.3 Gigabyte
Corpus size: 1,914,025,954 token
Number of queriable words: 420,184 (including multi-word lexemes like take_off)
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of stop words, removal of all words with frequency < 50, lemmatization, converting all words in the corpus to lower case, identifiying multi-word lexemes (they contain underscore instead of space).

List of all multi-word lexemes with their frequencies: enwiki-20130403-sim-lemma-mwl-lc_MWL.txt
Stop word list: stopword-list_en_utf8.txt

Parameters used for word space computation: Context window +-3 words regarding exact position, 30,000 most frequent lemmata as features, significance measure from Kolb 2009 with threshold 0.1, similarity measure from Lin 1998. The computation took 11 days on a core-i3 and used 368 gigabytes of disk space.
Corpus:

the English Wikipedia (dump from 3rd April 2013)

Download and installation:

Download the archive enwiki-20130403-sim-lemma-mwl-lc.tar.bz2 and unpack it (enter disco as password).
If you correctly unarchive the packet, you'll have a directory named enwiki-20130403-sim-lemma-mwl-lc. Don't change the names of any files inside the directory! (You may change the directory name.)

Packet name: en-BNC-20080721
Packet size: 1.7 Gigabyte
Corpus size: 119 million token
Number of queriable words: 122,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words.
Corpus:

the British National Corpus (BNC)

Download and installation:

Download the archive en-BNC-20080721.tbz2 and unpack it (enter disco as password).

Packet name: en-PubMedOA-20070903
Packet size: 864 Megabyte
Corpus size: 181 million token
Number of queriable words: 60,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words.
Corpus:

approx. 100,000 medical articles from the PubMed Open Access database (July 2007).

Download and installation:

Download the archive en-PubMedOA-20070903.tar and extract it (enter disco as password).

Packet name: en-wikipedia-20080101
Packet size: 5.9 Gigabyte
Corpus size: 267 million token
Number of queriable words: 220,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words.
Corpus:

approx. 300,000 articles from the English Wikipedia as of January 2008.

Download and installation:

Download the archive en-wikipedia-20080101.tbz2 and unpack it (enter disco as password).

French

Packet name: fr-wikipedia-20110201-lemma
Packet size: 513 Megabyte
Corpus size: 458 million token
Number of queriable words: 154,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, lemmatization (using the Tree Tagger), deletion of the most frequent function words, deletion of all words with a frequency lower than 50.
Parameters used for word space computation: Context window +-3 words regarding exact position, 30,000 most frequent lemmata as features, significance measure from Kolb 2009 with threshold 0.1, similarity measure from Lin 1998.
Corpus:

French Wikipedia (XML dump from 1st February 2011)

Download and installation:

Download the archive fr-wikipedia-20110201-lemma.tar and unpack it (use disco as password).

Packet name: fr-wikipedia-20080713
Packet size: 2.4 Gigabyte
Corpus size: 105 million token
Number of queriable words: 188,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words, deletion of all words with a frequency lower than 12.
Corpus:

Encyclopedia

Download and installation:

Download the archive fr-wikipedia-20080713.tbz2 and unpack it (enter disco as password).

Italian

Packet name: it-general-20080815
Packet size: 2.3 Gigabyte
Corpus size: 104 million token
Number of queriable words: 164,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words, deletion of all words with a frequency lower than 12.
Corpus:

Encyclopedia (65 million token)
Parliamentary debates (39 million token)

Download and installation:

Download the archive it-general-20080815.tbz2 and unpack it (enter disco as password).

Dutch

Packet name: nl-general-20081004
Paket size: 4.0 Gigabyte
Corpus size: 114 million token
Number of queriable words: 200,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words, deletion of all words with a frequency lower than 10.
Corpus:

Encyclopedia (58.4 million token)
Parliamentary debates (37 million token)
Literature (13 million token)
Newspaper, radio (5.7 million token)

Download and installation:

Download the archive nl-general-20081004.tbz2 and unpack it (enter disco as password).

Czech

Packet name: cz-general-20080115
Packet size: 5.6 Gigabyte
Corpus size: 163 million token
Number of queriable words: 320,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words.
Corpus:

Newspaper articles 1998-2008 (59.5 million token)
EU documents (59.0 million token)
Encyclopedia January 2008 (34.9 million token)
Literature (fiction) 1850-2000 (10.4 million token)
Subtitles of movies and TV series (5.0 million token)

Download and installation:

Download the archive cz-general-20080115.tbz2 and unpack it (enter disco as password).

Spanish

Packet name: es-general-20080720
Packet size: 5.0 Gigabyte
Corpus size: 232 million token
Number of queriable words: 260,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words.
Corpus:

Encyclopedia July 2008 (184.6 million token)
Parliamentary debates (41.6 million token)
Literature (fiction) 1830-1930 (5.8 million token)

Download and installation:
Please note that the commercial usage of this language data packet is not allowed! (More information here.)

Download the archive es-general-20080720.tbz2 and unpack it (enter disco as password).

Russian

Packet name: ru-wikipedia-20110804
Packet size: 544 megabytes
Corpus size: 230 million token
Number of queriable words: 112,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words, deletion of all words with frequency lower than 100.

Stopword list: stopword-list_ru_utf8.txt

Parameters used for word space computation: Context window +-3 words regarding exact position, 15,000 most frequent lemmata as features, significance measure from Kolb 2009 with threshold 0.1, similarity measure from Lin 1998.
Corpus:

Russian Wikipedia (XML dump 2011-03-28)

Download and installation:

Download the archive ru-wikipedia-20110804.tar and unpack it (enter disco as password).