DISCO - Description of the language data packets and download
Arabic
Packet name: ar-general-20120124
Packet size: 518 megabytes
Corpus size: 188 million token
Number of queriable words: 134,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words, deletion of all words with frequency lower than 50.
- Stopword list: stopword-list_ar_utf8.txt
Parameters used for word space computation: Context window +-3 words regarding exact position, 50,000 most frequent lemmata as features, significance measure from Kolb 2009 with threshold 0.1, similarity measure from Lin 1998.
Corpus:
- Arabic Wikipedia (XML dump 2012-01-14)
- Ajdir Corpora (online newspapers)
Download and installation:
- Download the archive ar-general-20120124.tar and unpack it (enter disco as password).
German
Packet name: de-general-20131219-sim
Packet size: 2.2 gigabytes
Corpus size: 977,330,652 tokens
Number of queriable words: 246,119
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, deletion of stop words, removal of all words with a frequency lower than 100.
- Stopword list: stopword-list_de_utf8.txt
Parameters used for word space computation: Context window +-3 words regarding exact position, 30,000 most frequent words as features, significance measure from Kolb 2009 with threshold 0.1, similarity measure from Lin 1998.
Corpus:
- the German Wikipedia, dump from 12.12.2011 (569 million tokens)
- newspapers and magazines (297 million tokens)
- parliamentary debates (64 million tokens)
- fiction (31 million tokens)
- movie and TV subtitles (14 million tokens)
Download and installation:
- Download the archive de-general-20131219-sim.tar.bz2 and unpack it (use disco as password).
- After you have unpacked the archive there should be a directory named de-general-20131219-sim. Do not rename or edit any of the files in the directory! (You may change the name of the directory.)
Packet name: de-general-20080727
Packet size: 3.6 Gigabyte
Corpus size: 400 million token
Number of queriable words: 200,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words.
Corpus:
- Encyclopedia (273 million token)
- Newspaper (48 million token)
- Periodicals (32 million token)
- Parliamentary debates (27 million token)
- Literature: Fiction and Non-fiction (20 million token)
Download and installation:
Please note that the commercial usage of this language data packet is not allowed! (More information here.)
- Download the archive de-general-20080727.tbz2 and unpack it (the password is disco).
English
Packet name: enwiki-20130403-sim-lemma-mwl-lc
Packet size: 2.3 Gigabyte
Corpus size: 1,914,025,954 token
Number of queriable words: 420,184 (including multi-word lexemes like take_off)
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of stop words, removal of all words with frequency < 50, lemmatization, converting all words in the corpus to lower case, identifiying multi-word lexemes (they contain underscore instead of space).
- List of all multi-word lexemes with their frequencies: enwiki-20130403-sim-lemma-mwl-lc_MWL.txt
- Stop word list: stopword-list_en_utf8.txt
Parameters used for word space computation: Context window +-3 words regarding exact position, 30,000 most frequent lemmata as features, significance measure from Kolb 2009 with threshold 0.1, similarity measure from Lin 1998. The computation took 11 days on a core-i3 and used 368 gigabytes of disk space.
Corpus:
- the English Wikipedia (dump from 3rd April 2013)
Download and installation:
- Download the archive enwiki-20130403-sim-lemma-mwl-lc.tar.bz2 and unpack it (enter disco as password).
- If you correctly unarchive the packet, you'll have a directory named enwiki-20130403-sim-lemma-mwl-lc. Don't change the names of any files inside the directory! (You may change the directory name.)
Packet name: en-BNC-20080721
Packet size: 1.7 Gigabyte
Corpus size: 119 million token
Number of queriable words: 122,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words.
Corpus:
- the British National Corpus (BNC)
Download and installation:
- Download the archive en-BNC-20080721.tbz2 and unpack it (enter disco as password).
Packet name: en-PubMedOA-20070903
Packet size: 864 Megabyte
Corpus size: 181 million token
Number of queriable words: 60,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words.
Corpus:
- approx. 100,000 medical articles from the PubMed Open Access database (July 2007).
Download and installation:
- Download the archive en-PubMedOA-20070903.tar and extract it (enter disco as password).
Packet name: en-wikipedia-20080101
Packet size: 5.9 Gigabyte
Corpus size: 267 million token
Number of queriable words: 220,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words.
Corpus:
- approx. 300,000 articles from the English Wikipedia as of January 2008.
Download and installation:
- Download the archive en-wikipedia-20080101.tbz2 and unpack it (enter disco as password).
French
Packet name: fr-wikipedia-20110201-lemma
Packet size: 513 Megabyte
Corpus size: 458 million token
Number of queriable words: 154,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, lemmatization (using the Tree Tagger), deletion of the most frequent function words, deletion of all words with a frequency lower than 50.
Parameters used for word space computation: Context window +-3 words regarding exact position, 30,000 most frequent lemmata as features, significance measure from Kolb 2009 with threshold 0.1, similarity measure from Lin 1998.
Corpus:
- French Wikipedia (XML dump from 1st February 2011)
Download and installation:
- Download the archive fr-wikipedia-20110201-lemma.tar and unpack it (use disco as password).
Packet name: fr-wikipedia-20080713
Packet size: 2.4 Gigabyte
Corpus size: 105 million token
Number of queriable words: 188,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization,
deletion of the most frequent function words, deletion of all
words with a frequency lower than 12.
Corpus:
Download and installation:
- Download the archive fr-wikipedia-20080713.tbz2 and unpack it (enter disco as password).
Italian
Packet name: it-general-20080815
Packet size: 2.3 Gigabyte
Corpus size: 104 million token
Number of queriable words: 164,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization,
deletion of the most frequent function words, deletion of all
words with a frequency lower than 12.
Corpus:
- Encyclopedia (65 million token)
- Parliamentary debates (39 million token)
Download and installation:
- Download the archive it-general-20080815.tbz2 and unpack it (enter disco as password).
Dutch
Packet name: nl-general-20081004
Paket size: 4.0 Gigabyte
Corpus size: 114 million token
Number of queriable words: 200,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization,
deletion of the most frequent function words, deletion of all
words with a frequency lower than 10.
Corpus:
- Encyclopedia (58.4 million token)
- Parliamentary debates (37 million token)
- Literature (13 million token)
- Newspaper, radio (5.7 million token)
Download and installation:
- Download the archive nl-general-20081004.tbz2 and unpack it (enter disco as password).
Czech
Packet name: cz-general-20080115
Packet size: 5.6 Gigabyte
Corpus size: 163 million token
Number of queriable words: 320,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words.
Corpus:
- Newspaper articles 1998-2008 (59.5 million token)
- EU documents (59.0 million token)
- Encyclopedia January 2008 (34.9 million token)
- Literature (fiction) 1850-2000 (10.4 million token)
- Subtitles of movies and TV series (5.0 million token)
Download and installation:
- Download the archive cz-general-20080115.tbz2 and unpack it (enter disco as password).
Spanish
Packet name: es-general-20080720
Packet size: 5.0 Gigabyte
Corpus size: 232 million token
Number of queriable words: 260,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words.
Corpus:
- Encyclopedia July 2008 (184.6 million token)
- Parliamentary debates (41.6 million token)
- Literature (fiction) 1830-1930 (5.8 million token)
Download and installation:
Please note that the commercial usage of this language data packet is not allowed! (More information here.)
- Download the archive es-general-20080720.tbz2 and unpack it (enter disco as password).
Russian
Packet name: ru-wikipedia-20110804
Packet size: 544 megabytes
Corpus size: 230 million token
Number of queriable words: 112,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words, deletion of all words with frequency lower than 100.
- Stopword list: stopword-list_ru_utf8.txt
Parameters used for word space computation: Context window +-3 words regarding exact position, 15,000 most frequent lemmata as features, significance measure from Kolb 2009 with threshold 0.1, similarity measure from Lin 1998.
Corpus:
- Russian Wikipedia (XML dump 2011-03-28)
Download and installation:
- Download the archive ru-wikipedia-20110804.tar and unpack it (enter disco as password).