Translation in context
Ontology building
Semantic similarity
Visualisation of lexical fields

DISCO - Wordspaces for DISCO API 2.0 and above

Arabic

ar-cc-fasttext-col

This Arabic word space was imported from fastText and contains word forms.
Word space type: COL
Word space size: 2.4 gigabytes
Corpus size: unknown
Number of queriable words: 2,000,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization with ICU Tokenizer.
Parameters used for word space computation: Imported the pre-trained word vectors from fasttext.cc: cc.ar.300.vec using DISCOBuilder. The vectors were trained using CBOW with position-weights, 300 dimensions, with character n-grams of length 5, a window of size 5 and 10 negative samples.
Corpus: Common Crawl
License: Creative Commons Attribution-Share-Alike License 3.0
Download and installation: Download the archive cc.ar.300-COL.denseMatrix.bz2 and unpack it.

ar-cc-fasttext-sim

This Arabic word space was imported from fastText and contains word forms.
Word space type: SIM
Word space size: 5.4 gigabytes
Corpus size: unknown
Number of queriable words: 2,000,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization with ICU Tokenizer.
Parameters used for word space computation: Imported the pre-trained word vectors from fasttext.cc: cc.ar.300.vec using DISCOBuilder. The word space stores the 200 most similar words for each word, computed with vector similarity measure COSINE.
The vectors were trained with fastText using CBOW with position-weights, 300 dimensions, with character n-grams of length 5, a window of size 5 and 10 negative samples.
Corpus: Common Crawl
License: Creative Commons Attribution-Share-Alike License 3.0
Download and installation: Download the archive cc.ar.300-SIM.denseMatrix.bz2 and unpack it.

English

enwiki-20130403-sim-lemma-mwl-lc

This English word space contains lowercased lemmata.
Word space type: SIM
Word space size: 2.3 gigabytes
Corpus size: 1.9 billion token
Number of queriable words: 420,184
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of stop words, removal of all words with frequency < 50, lemmatization, converting all words in the corpus to lower case, identifiying multi-word lexemes (they contain underscore instead of space).

List of all multi-word lexemes with their frequencies: enwiki-20130403-sim-lemma-mwl-lc_MWL.txt
Stop word list: stopword-list_en_utf8.txt

Parameters used for word space computation: Context window +-3 words regarding exact position, 30,000 most frequent lemmata as features, significance measure from Kolb 2009 with threshold 0.1, similarity measure from Lin 1998.
Corpus:

the English Wikipedia (dump from 3rd April 2013)

License: Creative Commons Attribution 3.0 Unported
Download and installation: Download the archive enwiki-20130403-sim-lemma-mwl-lc.tar.bz2 and unpack it.

enwiki-20130403-word2vec-lm-mwl-lc-sim

This English word space contains lowercased lemmata. It was created using word2vec and then converted into a DISCO word space with the import functionality of DISCO Builder.
Word space type: SIM
Word space size: 1.4 gigabytes
Corpus size: 1.9 billion token
Number of queriable words: 420,184
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of all words with frequency < 50, lemmatization, converting all words in the corpus to lower case, identifiying multi-word lexemes (they contain underscore instead of space).

List of all multi-word lexemes with their frequencies: enwiki-20130403-sim-lemma-mwl-lc_MWL.txt

Parameters used for word space computation: word2vec was run with the parameters -size 400 -negative 10 -min-count 50, producing a CBOW word space. Vector similarity measure used by DISCO Builder was COSINE.
Corpus:

the English Wikipedia (dump from 3rd April 2013)

License: Creative Commons Attribution 3.0 Unported
Download and installation: Download the archive enwiki-20130403-word2vec-lm-mwl-lc-sim.tar.bz2 and unpack it.

enwiki-20130403-word2vec-lm-mwl-lc-sim.denseMatrix

This English word space contains lowercased lemmata. It was created using word2vec and then converted into a DenseMatrix DISCO word space with the import functionality of DISCO Builder.
Word space type: SIM
Word space size: 1.6 gigabytes
Corpus size: 1.9 billion token
Number of queriable words: 420,184
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of all words with frequency < 50, lemmatization, converting all words in the corpus to lower case, identifiying multi-word lexemes (they contain underscore instead of space).

List of all multi-word lexemes with their frequencies: enwiki-20130403-sim-lemma-mwl-lc_MWL.txt

the English Wikipedia (dump from 3rd April 2013)

License: Creative Commons Attribution 3.0 Unported
Download and installation: Download the archive enwiki-20130403-word2vec-lm-mwl-lc-sim.denseMatrix.bz2 and unpack it.

French

fr-general-20151126-lm-sim

This French word space contains lemmata.
Word space type: SIM
Word space size: 2.1 gigabytes
Corpus size: 1.9 billion token
Number of queriable words: 276,967
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of stop words, removal of all words with frequency < 50, and lemmatization.

Stop word list: stopword-list_fr_utf8.txt

Parameters used for word space computation: Context window +-3 words regarding exact position, 50,000 most frequent lemmata as features, significance measure from Kolb 2009 with threshold 0.1, similarity measure from Lin 1998.
Corpus:

692,319,667 tokens: the French Wikipedia (dump from 4th August 2014)
598,392,935 tokens: News
520,189,432 tokens: debates from EU and UN
185,987,928 tokens: subtitles
2,093,280 tokens: books from the Project Gutenberg

License: Creative Commons Attribution 3.0 Unported
Download and installation: Download the archive fr-general-20151126-lm-sim.tar.bz2 and unpack it.

fr-general-20151126-lm-word2vec-sim

This French word space contains lemmata. It was created using word2vec and then converted into a DISCO word space with the import functionality of DISCO Builder.
Word space type: SIM
Word space size: 1.7 gigabytes
Corpus size: 1.9 billion token
Number of queriable words: 281,484
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of all words with frequency < 50, and lemmatization.
Parameters used for word space computation: word2vec was run with the parameters -size 400 -negative 10 -min-count 50, producing a CBOW word space. Vector similarity measure used by DISCO Builder was COSINE.
Corpus:

692,319,667 tokens: the French Wikipedia (dump from 4th August 2014)
598,392,935 tokens: News
520,189,432 tokens: debates from EU and UN
185,987,928 tokens: subtitles
2,093,280 tokens: books from the Project Gutenberg

License: Creative Commons Attribution 3.0 Unported
Download and installation: Download the archive fr-general-20151126-lm-word2vec-sim.tar.bz2 and unpack it.

German

de-general-20150421-lm-sim

This German word space contains lemmata.
Word space type: SIM
Word space size: 3.5 gigabytes
Corpus size: 1.5 billion token
Number of queriable words: 470,788
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of stop words, removal of all words with frequency < 50, and lemmatization.

Stop word list: stopword-list_de_utf8.txt

747,547,646 tokens: the German Wikipedia (dump from 25th July 2014)
335,937,237 tokens: Webcrawl (2014)
317,677,649 tokens: Newspapers and magazines (1949-2007)
64,174,384 tokens: parliamentary debates
25,195,504 tokens: books (mainly fiction from the period 1850-1920) from the Project Gutenberg
13,553,836 tokens: tv & movie subtitles

License: Creative Commons Attribution-NonCommercial 3.0 Unported
Download and installation: Download the archive de-general-20150421-lm-sim.tar.bz2 and unpack it.

de-general-lm-word2vec-sim

This German word space contains lemmata. It was created using word2vec and then converted into a DISCOLuceneIndex word space with the import functionality of DISCO Builder.
Word space type: SIM
Word space size: 3.0 gigabytes
Corpus size: 1.5 billion token
Number of queriable words: 470,788
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of all words with frequency < 50, and lemmatization.
Parameters used for word space computation: word2vec was run with the parameters -size 400 -negative 10 -min-count 50, producing a CBOW word space. Vector similarity measure used by DISCO Builder was COSINE.
Corpus:

747,547,646 tokens: the German Wikipedia (dump from 25th July 2014)
335,937,237 tokens: Webcrawl (2014)
317,677,649 tokens: Newspapers and magazines (1949-2007)
64,174,384 tokens: parliamentary debates
25,195,504 tokens: books (mainly fiction from the period 1850-1920) from the Project Gutenberg
13,553,836 tokens: tv & movie subtitles

License: Creative Commons Attribution-NonCommercial 3.0 Unported
Download and installation: Download the archive de-general-20150421-lm-word2vec-sim.tar.bz2 and unpack it.

de-general-20150421-lm-word2vec-sim.denseMatrix

This German word space contains lemmata. It was created using word2vec and then converted into a DenseMatrix DISCO word space with the import functionality of DISCO Builder.
Word space type: SIM
Word space size: 1.8 gigabytes
Corpus size: 1.5 billion token
Number of queriable words: 470,788
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of all words with frequency < 50, and lemmatization.
Parameters used for word space computation: word2vec was run with the parameters -size 400 -negative 10 -min-count 50, producing a CBOW word space. Vector similarity measure used by DISCO Builder was COSINE.
Corpus:

747,547,646 tokens: the German Wikipedia (dump from 25th July 2014)
335,937,237 tokens: Webcrawl (2014)
317,677,649 tokens: Newspapers and magazines (1949-2007)
64,174,384 tokens: parliamentary debates
25,195,504 tokens: books (mainly fiction from the period 1850-1920) from the Project Gutenberg
13,553,836 tokens: tv & movie subtitles

License: Creative Commons Attribution-NonCommercial 3.0 Unported
Download and installation: Download the archive de-general-20150421-lm-word2vec-sim.denseMatrix.bz2 and unpack it.

Russian

ru-ruwac-ruwiki-lm-sim

This Russian word space contains lemmata.
Word space type: SIM
Word space size: 2.8 gigabytes
Corpus size: 2.2 billion token
Number of queriable words: 226,108
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of stop words, removal of all words with frequency < 100, and lemmatization.

Stop word list: stopword-list_ru_utf8.txt

2,006,578,670 tokens: RuWaC (webcrawl)
223,358,656 tokens: ruwiki (Russian Wikipedia)

License: Creative Commons Attribution 3.0 Unported
Download and installation: Download the archive ru-ruwac-ruwiki-lm-sim.tar.bz2 and unpack it.

ru-ruwac-ruwiki-lm-word2vec-sim

This Russian word space contains lemmata. It was created using word2vec and then converted into a DISCO word space with the import functionality of DISCO Builder.
Word space type: SIM
Word space size: 2.6 gigabytes
Corpus size: 2.2 billion token
Number of queriable words: 226,108
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of all words with frequency < 100, and lemmatization.
Parameters used for word space computation: word2vec was run with the parameters -size 400 -negative 10 -min-count 100, producing a CBOW word space. Vector similarity measure used by DISCO Builder was COSINE.
Corpus:

2,006,578,670 tokens: RuWaC (webcrawl)
223,358,656 tokens: ruwiki (Russian Wikipedia)

License: Creative Commons Attribution 3.0 Unported
Download and installation: Download the archive ru-ruwac-ruwiki-lm-word2vec-sim.tar.bz2 and unpack it.

ru-ruwac-ruwiki-lem-col

This Russian word space contains lemmata.
Word space type: COL
Word space size: 2.3 gigabytes
Corpus size: 2.2 billion token
Number of queriable words: 226,108
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of stop words, removal of all words with frequency < 100, and lemmatization.

Stop word list: stopword-list_ru_utf8.txt

Parameters used for word space computation: Context window +-5 words, 200,000 most frequent lemmata as features, significance measure LOGLIKELIHOOD with threshold 7.0.
Corpus:

2,006,578,670 tokens: RuWaC (webcrawl)
223,358,656 tokens: ruwiki (Russian Wikipedia)

License: Creative Commons Attribution 3.0 Unported
Download and installation: Download the archive ru-ruwac-ruwiki-lem-col.tar.bz2 and unpack it.

ru-ruwac-ruwiki-col

This Russian word space contains word forms.
Word space type: COL
Word space size: 4.9 gigabytes
Corpus size: 2.2 billion token
Number of queriable words: 508,350
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of stop words, removal of all words with frequency < 100.

Stop word list: stopword-list_ru_utf8.txt

Parameters used for word space computation: Context window +-3 words, 50,000 most frequent word forms as features, significance measure from Kolb 2009 with threshold 0.5.
Corpus:

2,006,578,670 tokens: RuWaC (webcrawl)
223,358,656 tokens: ruwiki (Russian Wikipedia)

License: Creative Commons Attribution 3.0 Unported
Download and installation: Download the archive ru-ruwac-ruwiki-col.tar.bz2 and unpack it.