DISCO - Download
The DISCO API is open source and licensed under the Apache License, version 2.0.
You need the Java archive disco-3.0.0-all.jar (the DISCO API) and a word space from the table below. Click on a link in the column Word Space Name for a word space description and the download link. Note that the Protege plugin only works with word spaces for DISCO API version 1!
You can construct word spaces from your own corpora with DISCO Builder. It also allows to import vector files that were produced with other tools like fastText, word2vec or GloVe.
Other Downloads:
- Source code of DISCO API is on Github: https://github.com/linguatools/disco
- See DISCO page on JitPack on how to include DISCO API in your Maven or Gradle project
- API documentation: javadoc
Language | Word Space Name | Corpus Size | Number of Words | Storage type | Word Space Type | API Version | License |
Arabic | ar-cc-fasttext-col | unknown | 2,000,000 | DenseMatrix | COL | 3.x | |
ar-cc-fasttext-sim | unknown | 2,000,000 | DenseMatrix | SIM | 3.x | ||
English | enwiki-20130403-sim-lemma-mwl-lc | 1.9 billion token | 420,184 | DISCOLuceneIndex | SIM | 2.x, 3.x | |
enwiki-20130403-word2vec-lm-mwl-lc-sim | 1.9 billion token | 420,184 | DenseMatrix | SIM | 3.x | ||
enwiki-20130403-word2vec-lm-mwl-lc-sim | 1.9 billion token | 420,184 | DISCOLuceneIndex | SIM | 2.x, 3.x | ||
French | fr-general-20151126-lm-sim | 1.9 billion token | 276,967 | DISCOLuceneIndex | SIM | 2.x, 3.x | |
fr-general-20151126-lm-word2vec-sim | 1.9 billion token | 281,484 | DISCOLuceneIndex | SIM | 2.x, 3.x | ||
German | de-general-20150421-lm-word2vec-sim | 1.5 billion token | 470,788 | DenseMatrix | SIM | 3.x | |
de-general-20150421-lm-sim | 1.5 billion token | 470,788 | DISCOLuceneIndex | SIM | 2.x, 3.x | ||
de-general-20150503-lm-word2vec-sim | 1.5 billion token | 470,788 | DISCOLuceneIndex | SIM | 2.x, 3.x | ||
Russian | ru-ruwac-ruwiki-lm-sim | 2.2 billion token | 226,108 | DISCOLuceneIndex | SIM | 2.x, 3.x | |
ru-ruwac-ruwiki-lm-word2vec-sim | 2.2 billion token | 226,108 | DISCOLuceneIndex | SIM | 2.x, 3.x | ||
ru-ruwac-ruwiki-lem-col | 2.2 billion token | 226,108 | DISCOLuceneIndex | COL | 2.x, 3.x | ||
ru-ruwac-ruwiki-col | 2.2 billion token | 508,350 | DISCOLuceneIndex | COL | 2.x, 3.x |
You can find word spaces for the old DISCO API version 1.x at the bottom of this page!
If you want to use the above word spaces with a CC-BY-NC license commercially please contact
peter.kolb@linguatools.org.
Version history
28. June 2018 DISCO API version 3.0
Version 3 introduces a new storage class DenseMatrix
that is suited for low-dimensional word embeddings. Additionally, there are several new methods in class Compositionality, like computing the
shortest path between two words in a word space of type SIM, or an approximate nearest neighbor search to find the most similar word
for a given word or word embedding.
DISCO API now has Sux4J as an additional dependency.
Finally, DISCO API is now a Gradle project - see the GitHub repository.
11. August 2015 DISCO API version 2.1
Minor update concerning methods DISCO.semanticSimilarity,
TextSimilarity.textSimilarity,
and TextSimilarity.directedTextSimilarity. In API version 2.1,
the desired similarity measure must be passed to these methods. This is neccessary because the similarity measure SimilarityMeasure.KOLB that was used
by default does not produce sensible results with word spaces imported from word2vec.
- DISCO API version 2.1: disco-2.1.jar
- Source code of DISCO API v2.1: disco-2.1-src.tar.gz
- Example Java class: UseDISCO.java
- API documentation: javadoc
20. May 2015 DISCO API version 2.0
With the release of DISCO Builder the structure of the word spaces has been changed. Therefore, the new
API version 2.0 is not compatible with older word spaces. There are now two types of word spaces: SIM and COL.
Quite a number of methods have been added to the DISCO API, including methods for computing text similarity, textual entailment, and
clustering of similar words. See the API documentation for more information.
The API now uses version 5.1.0 of Lucene.
- DISCO API version 2.0: disco-2.0.jar
- Source code of DISCO API: disco-2.0-src.tar.gz
- Example Java class: UseDISCO.java
- API documentation: javadoc
28. February 2013 DISCO API version 1.4
The API contains a new class Compositionality with methods for the computation of the compositional similarity between multi-token words or phrases. Also, the method DISCO.commonContext was added to the API.
- DISCO API version 1.4: disco-1.4.jar
- source code of DISCO API 1.4: disco-1.4-src.zip
- Example Java class: UseDISCO.java
- API documentation for version 1.4: javadoc
16. March 2012 DISCO API version 1.3
The API now uses the latest version 3.5 of Lucene.
The command line option -wl allows to print the complete word frequency list of the language data packet into a text file. This option can also be used to check a downloaded language data packet for errors.
- DISCO API version 1.3: disco-1.3.jar
- Source code of DISCO API: disco-1.3-src.zip
- API documentation for version 1.3: javadoc
24. March 2011 DISCO API version 1.2
Version 1.2 of DISCO API allows to load language data packets (word spaces) into main memory (provided that you have enough RAM). This strongly reduces computation time.
- DISCO API version 1.2: disco-1.2.jar
- Source code of DISCO API: disco-1.2-src.zip
- Example Java class: UseDISCO.java
- API documentation for version 1.2: javadoc
18. September 2008 DISCO API version 1.1
First version of the API being available online.
- DISCO API version 1.1: disco-1.1.jar
- Source code of DISCO API: disco-1.1-src.zip
- Example Java class: UseDISCOv1_1.java
- API documentation for version 1.1: javadoc
Word spaces for old DISCO API version 1.x
These word spaces work with the Protege plugin.
Language | Word Space Name | Corpus Size | Number of Words | Packet Size | API Version | License |
Arabic | ar-general-20120124 | 188 million token | 134,000 | 518 MB | 1.x | |
Czech | cz-general-20080115 | 163 million token | 300,000 | 5.6 GB | 1.x | Apache License 2.0 |
Dutch | nl-general-20081004 | 114 million token | 200,000 | 4.0 GB | 1.x | Apache License 2.0 |
English | enwiki-20130403-sim-lemma-mwl-lc | 1.9 billion token | 420,184 | 2.3 GB | 1.x | Apache License 2.0 |
en-BNC-20080721 | 119 million token | 122,000 | 1.7 GB | 1.x | Apache License 2.0 | |
en-PubMedOA-20070501 | 181 million token | 60,000 | 864 MB | 1.x | Apache License 2.0 | |
en-wikipedia-20080101 | 267 million token | 220,000 | 5.9 GB | 1.x | Apache License 2.0 | |
French | fr-wikipedia-20110201-lemma | 458 million token | 154,000 | 513 MB | 1.x | Apache License 2.0 |
fr-wikipedia-20080713 | 105 million token | 188,000 | 2.4 GB | 1.x | Apache License 2.0 | |
German | de-general-20131219-sim | 977 million token | 246,119 | 2.2 GB | 1.x | Apache License 2.0 |
de-general-20080727 | 400 million token | 200,000 | 3.6 GB | 1.x | ||
Italian | it-general-20080815 | 104 million token | 164,000 | 2.3 GB | 1.x | Apache License 2.0 |
Russian | ru-wikipedia-20110804 | 230 million token | 112,000 | 544 MB | 1.x | Apache License 2.0 |
Spanish | es-general-20080720 | 232 million token | 260,000 | 5.0 GB | 1.x |