DISCO - Wordspaces for DISCO API 2.0 and above
Arabic
ar-cc-fasttext-col
This Arabic word space was imported from fastText and contains word forms.
Word space type: COL
Word space size: 2.4 gigabytes
Corpus size: unknown
Number of queriable words: 2,000,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization with ICU Tokenizer.
Parameters used for word space computation: Imported the pre-trained word vectors from fasttext.cc: cc.ar.300.vec using DISCOBuilder.
The vectors were trained using CBOW with position-weights, 300 dimensions, with character n-grams of length 5, a window of size 5 and 10 negative samples.
Corpus: Common Crawl
License: Creative Commons Attribution-Share-Alike License 3.0
Download and installation: Download the archive cc.ar.300-COL.denseMatrix.bz2 and unpack it.
ar-cc-fasttext-sim
This Arabic word space was imported from fastText and contains word forms.
Word space type: SIM
Word space size: 5.4 gigabytes
Corpus size: unknown
Number of queriable words: 2,000,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization with ICU Tokenizer.
Parameters used for word space computation: Imported the pre-trained word vectors from fasttext.cc: cc.ar.300.vec using DISCOBuilder. The word space stores the 200 most similar words for each word, computed with vector similarity measure COSINE.
The vectors were trained with fastText using CBOW with position-weights, 300 dimensions, with character n-grams of length 5, a window of size 5 and 10 negative samples.
Corpus: Common Crawl
License: Creative Commons Attribution-Share-Alike License 3.0
Download and installation: Download the archive cc.ar.300-SIM.denseMatrix.bz2 and unpack it.
English
enwiki-20130403-sim-lemma-mwl-lc
This English word space contains lowercased lemmata.
Word space type: SIM
Word space size: 2.3 gigabytes
Corpus size: 1.9 billion token
Number of queriable words: 420,184
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of stop words, removal of all words with frequency < 50, lemmatization, converting all words in the corpus to lower case, identifiying multi-word lexemes (they contain underscore instead of space).
- List of all multi-word lexemes with their frequencies: enwiki-20130403-sim-lemma-mwl-lc_MWL.txt
- Stop word list: stopword-list_en_utf8.txt
Parameters used for word space computation: Context window +-3 words regarding exact position, 30,000 most frequent lemmata as features, significance measure from Kolb 2009 with threshold 0.1, similarity measure from Lin 1998.
Corpus:
- the English Wikipedia (dump from 3rd April 2013)
License: Creative Commons Attribution 3.0 Unported Download and installation: Download the archive enwiki-20130403-sim-lemma-mwl-lc.tar.bz2 and unpack it.
enwiki-20130403-word2vec-lm-mwl-lc-sim
This English word space contains lowercased lemmata. It was created using
word2vec and then converted into a DISCO word space with the import
functionality of DISCO Builder.
Word space type: SIM
Word space size: 1.4 gigabytes
Corpus size: 1.9 billion token
Number of queriable words: 420,184
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of all words with frequency < 50, lemmatization, converting all words in the corpus to lower case, identifiying multi-word lexemes (they contain underscore instead of space).
- List of all multi-word lexemes with their frequencies: enwiki-20130403-sim-lemma-mwl-lc_MWL.txt
Parameters used for word space computation: word2vec was run with the parameters -size 400 -negative 10 -min-count 50, producing a CBOW word space. Vector similarity measure used by DISCO Builder was COSINE.
Corpus:
- the English Wikipedia (dump from 3rd April 2013)
License: Creative Commons Attribution 3.0 Unported Download and installation: Download the archive enwiki-20130403-word2vec-lm-mwl-lc-sim.tar.bz2 and unpack it.
enwiki-20130403-word2vec-lm-mwl-lc-sim.denseMatrix
This English word space contains lowercased lemmata. It was created using
word2vec and then converted into a DenseMatrix DISCO word space with the import
functionality of DISCO Builder.
Word space type: SIM
Word space size: 1.6 gigabytes
Corpus size: 1.9 billion token
Number of queriable words: 420,184
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of all words with frequency < 50, lemmatization, converting all words in the corpus to lower case, identifiying multi-word lexemes (they contain underscore instead of space).
- List of all multi-word lexemes with their frequencies: enwiki-20130403-sim-lemma-mwl-lc_MWL.txt
Parameters used for word space computation: word2vec was run with the parameters -size 400 -negative 10 -min-count 50, producing a CBOW word space. Vector similarity measure used by DISCO Builder was COSINE.
Corpus:
- the English Wikipedia (dump from 3rd April 2013)
License: Creative Commons Attribution 3.0 Unported Download and installation: Download the archive enwiki-20130403-word2vec-lm-mwl-lc-sim.denseMatrix.bz2 and unpack it.
French
fr-general-20151126-lm-sim
This French word space contains lemmata.
Word space type: SIM
Word space size: 2.1 gigabytes
Corpus size: 1.9 billion token
Number of queriable words: 276,967
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of stop words, removal of all words with frequency < 50, and lemmatization.
- Stop word list: stopword-list_fr_utf8.txt
Parameters used for word space computation: Context window +-3 words regarding exact position, 50,000 most frequent lemmata as features, significance measure from Kolb 2009 with threshold 0.1, similarity measure from Lin 1998.
Corpus:
- 692,319,667 tokens: the French Wikipedia (dump from 4th August 2014)
- 598,392,935 tokens: News
- 520,189,432 tokens: debates from EU and UN
- 185,987,928 tokens: subtitles
- 2,093,280 tokens: books from the Project Gutenberg
License: Creative Commons Attribution 3.0 Unported
Download and installation: Download the archive fr-general-20151126-lm-sim.tar.bz2 and unpack it.
fr-general-20151126-lm-word2vec-sim
This French word space contains lemmata. It was created using
word2vec and then converted into a DISCO word space with the import
functionality of DISCO Builder.
Word space type: SIM
Word space size: 1.7 gigabytes
Corpus size: 1.9 billion token
Number of queriable words: 281,484
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of all words with frequency < 50, and lemmatization.
Parameters used for word space computation: word2vec was run with the parameters -size 400 -negative 10 -min-count 50, producing a CBOW word space. Vector similarity measure used by DISCO Builder was COSINE.
Corpus:
- 692,319,667 tokens: the French Wikipedia (dump from 4th August 2014)
- 598,392,935 tokens: News
- 520,189,432 tokens: debates from EU and UN
- 185,987,928 tokens: subtitles
- 2,093,280 tokens: books from the Project Gutenberg
License: Creative Commons Attribution 3.0 Unported
Download and installation: Download the archive fr-general-20151126-lm-word2vec-sim.tar.bz2 and unpack it.
German
de-general-20150421-lm-sim
This German word space contains lemmata.
Word space type: SIM
Word space size: 3.5 gigabytes
Corpus size: 1.5 billion token
Number of queriable words: 470,788
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of stop words, removal of all words with frequency < 50, and lemmatization.
- Stop word list: stopword-list_de_utf8.txt
Parameters used for word space computation: Context window +-3 words regarding exact position, 50,000 most frequent lemmata as features, significance measure from Kolb 2009 with threshold 0.1, similarity measure from Lin 1998.
Corpus:
- 747,547,646 tokens: the German Wikipedia (dump from 25th July 2014)
- 335,937,237 tokens: Webcrawl (2014)
- 317,677,649 tokens: Newspapers and magazines (1949-2007)
- 64,174,384 tokens: parliamentary debates
- 25,195,504 tokens: books (mainly fiction from the period 1850-1920) from the Project Gutenberg
- 13,553,836 tokens: tv & movie subtitles
License: Creative Commons Attribution-NonCommercial 3.0 Unported Download and installation: Download the archive de-general-20150421-lm-sim.tar.bz2 and unpack it.
de-general-lm-word2vec-sim
This German word space contains lemmata. It was created using
word2vec and then converted into a DISCOLuceneIndex word space with the import
functionality of DISCO Builder.
Word space type: SIM
Word space size: 3.0 gigabytes
Corpus size: 1.5 billion token
Number of queriable words: 470,788
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of all words with frequency < 50, and lemmatization.
Parameters used for word space computation: word2vec was run with the parameters -size 400 -negative 10 -min-count 50, producing a CBOW word space. Vector similarity measure used by DISCO Builder was COSINE.
Corpus:
- 747,547,646 tokens: the German Wikipedia (dump from 25th July 2014)
- 335,937,237 tokens: Webcrawl (2014)
- 317,677,649 tokens: Newspapers and magazines (1949-2007)
- 64,174,384 tokens: parliamentary debates
- 25,195,504 tokens: books (mainly fiction from the period 1850-1920) from the Project Gutenberg
- 13,553,836 tokens: tv & movie subtitles
License: Creative Commons Attribution-NonCommercial 3.0 Unported Download and installation: Download the archive de-general-20150421-lm-word2vec-sim.tar.bz2 and unpack it.
de-general-20150421-lm-word2vec-sim.denseMatrix
This German word space contains lemmata. It was created using
word2vec and then converted into a DenseMatrix DISCO word space with the import
functionality of DISCO Builder.
Word space type: SIM
Word space size: 1.8 gigabytes
Corpus size: 1.5 billion token
Number of queriable words: 470,788
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of all words with frequency < 50, and lemmatization.
Parameters used for word space computation: word2vec was run with the parameters -size 400 -negative 10 -min-count 50, producing a CBOW word space. Vector similarity measure used by DISCO Builder was COSINE.
Corpus:
- 747,547,646 tokens: the German Wikipedia (dump from 25th July 2014)
- 335,937,237 tokens: Webcrawl (2014)
- 317,677,649 tokens: Newspapers and magazines (1949-2007)
- 64,174,384 tokens: parliamentary debates
- 25,195,504 tokens: books (mainly fiction from the period 1850-1920) from the Project Gutenberg
- 13,553,836 tokens: tv & movie subtitles
License: Creative Commons Attribution-NonCommercial 3.0 Unported Download and installation: Download the archive de-general-20150421-lm-word2vec-sim.denseMatrix.bz2 and unpack it.
Russian
ru-ruwac-ruwiki-lm-sim
This Russian word space contains lemmata.
Word space type: SIM
Word space size: 2.8 gigabytes
Corpus size: 2.2 billion token
Number of queriable words: 226,108
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of stop words, removal of all words with frequency < 100, and lemmatization.
- Stop word list: stopword-list_ru_utf8.txt
Parameters used for word space computation: Context window +-3 words regarding exact position, 50,000 most frequent lemmata as features, significance measure from Kolb 2009 with threshold 0.1, similarity measure from Lin 1998.
Corpus:
License: Creative Commons Attribution 3.0 Unported Download and installation: Download the archive ru-ruwac-ruwiki-lm-sim.tar.bz2 and unpack it.
ru-ruwac-ruwiki-lm-word2vec-sim
This Russian word space contains lemmata. It was created using
word2vec and then converted into a DISCO word space with the import
functionality of DISCO Builder.
Word space type: SIM
Word space size: 2.6 gigabytes
Corpus size: 2.2 billion token
Number of queriable words: 226,108
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of all words with frequency < 100, and lemmatization.
Parameters used for word space computation: word2vec was run with the parameters -size 400 -negative 10 -min-count 100, producing a CBOW word space. Vector similarity measure used by DISCO Builder was COSINE.
Corpus:
License: Creative Commons Attribution 3.0 Unported Download and installation: Download the archive ru-ruwac-ruwiki-lm-word2vec-sim.tar.bz2 and unpack it.
ru-ruwac-ruwiki-lem-col
This Russian word space contains lemmata.
Word space type: COL
Word space size: 2.3 gigabytes
Corpus size: 2.2 billion token
Number of queriable words: 226,108
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of stop words, removal of all words with frequency < 100, and lemmatization.
- Stop word list: stopword-list_ru_utf8.txt
Parameters used for word space computation: Context window +-5 words, 200,000 most frequent lemmata as features, significance measure LOGLIKELIHOOD with threshold 7.0.
Corpus:
License: Creative Commons Attribution 3.0 Unported Download and installation: Download the archive ru-ruwac-ruwiki-lem-col.tar.bz2 and unpack it.
ru-ruwac-ruwiki-col
This Russian word space contains word forms.
Word space type: COL
Word space size: 4.9 gigabytes
Corpus size: 2.2 billion token
Number of queriable words: 508,350
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of stop words, removal of all words with frequency < 100.
- Stop word list: stopword-list_ru_utf8.txt
Parameters used for word space computation: Context window +-3 words, 50,000 most frequent word forms as features, significance measure from Kolb 2009 with threshold 0.5.
Corpus:
License: Creative Commons Attribution 3.0 Unported Download and installation: Download the archive ru-ruwac-ruwiki-col.tar.bz2 and unpack it.