Kontext-Wörterbuch - Suche in Millionen Beispielsätzen

DISCO - Description of the language data packets and download

Arabic

Packet name: ar-general-20120124
Packet size: 518 megabytes
Corpus size: 188 million token
Number of queriable words: 134,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words, deletion of all words with frequency lower than 50.

Parameters used for word space computation: Context window +-3 words regarding exact position, 50,000 most frequent lemmata as features, significance measure from Kolb 2009 with threshold 0.1, similarity measure from Lin 1998.
Corpus:

Download and installation:


German

Packet name: de-general-20131219-sim
Packet size: 2.2 gigabytes
Corpus size: 977,330,652 tokens
Number of queriable words: 246,119
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, deletion of stop words, removal of all words with a frequency lower than 100.

Parameters used for word space computation: Context window +-3 words regarding exact position, 30,000 most frequent words as features, significance measure from Kolb 2009 with threshold 0.1, similarity measure from Lin 1998.
Corpus:

Download and installation:


Packet name: de-general-20080727
Packet size: 3.6 Gigabyte
Corpus size: 400 million token
Number of queriable words: 200,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words.
Corpus:

Download and installation:
Please note that the commercial usage of this language data packet is not allowed! (More information here.)


English

Packet name: enwiki-20130403-sim-lemma-mwl-lc
Packet size: 2.3 Gigabyte
Corpus size: 1,914,025,954 token
Number of queriable words: 420,184 (including multi-word lexemes like take_off)
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of stop words, removal of all words with frequency < 50, lemmatization, converting all words in the corpus to lower case, identifiying multi-word lexemes (they contain underscore instead of space).

Parameters used for word space computation: Context window +-3 words regarding exact position, 30,000 most frequent lemmata as features, significance measure from Kolb 2009 with threshold 0.1, similarity measure from Lin 1998. The computation took 11 days on a core-i3 and used 368 gigabytes of disk space.
Corpus:

Download and installation:


Packet name: en-BNC-20080721
Packet size: 1.7 Gigabyte
Corpus size: 119 million token
Number of queriable words: 122,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words.
Corpus:

Download and installation:


Packet name: en-PubMedOA-20070903
Packet size: 864 Megabyte
Corpus size: 181 million token
Number of queriable words: 60,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words.
Corpus:

Download and installation:


Packet name: en-wikipedia-20080101
Packet size: 5.9 Gigabyte
Corpus size: 267 million token
Number of queriable words: 220,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words.
Corpus:

Download and installation:


French

Packet name: fr-wikipedia-20110201-lemma
Packet size: 513 Megabyte
Corpus size: 458 million token
Number of queriable words: 154,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, lemmatization (using the Tree Tagger), deletion of the most frequent function words, deletion of all words with a frequency lower than 50.
Parameters used for word space computation: Context window +-3 words regarding exact position, 30,000 most frequent lemmata as features, significance measure from Kolb 2009 with threshold 0.1, similarity measure from Lin 1998.
Corpus:

Download and installation:

Packet name: fr-wikipedia-20080713
Packet size: 2.4 Gigabyte
Corpus size: 105 million token
Number of queriable words: 188,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words, deletion of all words with a frequency lower than 12.
Corpus:

Download and installation:


Italian

Packet name: it-general-20080815
Packet size: 2.3 Gigabyte
Corpus size: 104 million token
Number of queriable words: 164,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words, deletion of all words with a frequency lower than 12.
Corpus:

Download and installation:


Dutch

Packet name: nl-general-20081004
Paket size: 4.0 Gigabyte
Corpus size: 114 million token
Number of queriable words: 200,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words, deletion of all words with a frequency lower than 10.
Corpus:

Download and installation:


Czech

Packet name: cz-general-20080115
Packet size: 5.6 Gigabyte
Corpus size: 163 million token
Number of queriable words: 320,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words.
Corpus:

Download and installation:


Spanish

Packet name: es-general-20080720
Packet size: 5.0 Gigabyte
Corpus size: 232 million token
Number of queriable words: 260,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words.
Corpus:

Download and installation:
Please note that the commercial usage of this language data packet is not allowed! (More information here.)


Russian

Packet name: ru-wikipedia-20110804
Packet size: 544 megabytes
Corpus size: 230 million token
Number of queriable words: 112,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words, deletion of all words with frequency lower than 100.

Parameters used for word space computation: Context window +-3 words regarding exact position, 15,000 most frequent lemmata as features, significance measure from Kolb 2009 with threshold 0.1, similarity measure from Lin 1998.
Corpus:

Download and installation: