DISCO - Download
The DISCO API is open source and licensed under the Apache License, version 2.0.
You need the Java archive disco-2.1.jar (the DISCO API) and a word space from the table below. Click on a link in the column Word Space Name for a word space description and the download link. Note that older word spaces are not compatible with the DISCO API version 2.x! Also note that the Protege plugin only works with word spaces for DISCO API version 1!
You can construct word spaces from your own corpora with DISCO Builder. It also allows to import vector files that were produced with other tools like word2vec or GloVe.
Other Downloads:
- Source code of DISCO API is now on Github: https://github.com/linguatools/disco
- API documentation: javadoc
Language | Word Space Name | Corpus Size | Number of Words | Packet Size | Word Space Type | API Version | License |
English | enwiki-20130403-sim-lemma-mwl-lc | 1.9 billion token | 420,184 | 2.3 GB | SIM | 2.0 | |
enwiki-20130403-word2vec-lm-mwl-lc-sim | 1.9 billion token | 420,184 | 1.4 GB | SIM | 2.0 | ||
French | fr-general-20151126-lm-sim | 1.9 billion token | 276,967 | 2.1 GB | SIM | 2.0 | |
fr-general-20151126-lm-word2vec-sim | 1.9 billion token | 281,484 | 1.7 GB | SIM | 2.0 | ||
German | de-general-20150421-lm-sim | 1.5 billion token | 470,788 | 3.5 GB | SIM | 2.0 | |
de-general-20150503-lm-word2vec-sim | 1.5 billion token | 470,788 | 3.0 GB | SIM | 2.0 | ||
Russian | ru-ruwac-ruwiki-lm-sim | 2.2 billion token | 226,108 | 2.8 GB | SIM | 2.0 | |
ru-ruwac-ruwiki-lm-word2vec-sim | 2.2 billion token | 226,108 | 2.6 GB | SIM | 2.0 | ||
ru-ruwac-ruwiki-lem-col | 2.2 billion token | 226,108 | 2.3 GB | COL | 2.0 | ||
ru-ruwac-ruwiki-col | 2.2 billion token | 508,350 | 4.9 GB | COL | 2.0 |
You can find word spaces for the old DISCO API version 1.x at the bottom of this page!
If you want to use the above word spaces with a CC-BY-NC license commercially please contact
peter.kolb@linguatools.org.
Version history
11. August 2015 DISCO API version 2.1
Minor update concerning methods DISCO.semanticSimilarity,
TextSimilarity.textSimilarity,
and TextSimilarity.directedTextSimilarity. In API version 2.1,
the desired similarity measure must be passed to these methods. This is neccessary because the similarity measure SimilarityMeasure.KOLB that was used
by default does not produce sensible results with word spaces imported from word2vec.
20. May 2015 DISCO API version 2.0
With the release of DISCO Builder the structure of the word spaces has been changed. Therefore, the new
API version 2.0 is not compatible with older word spaces. There are now two types of word spaces: SIM and COL.
Quite a number of methods have been added to the DISCO API, including methods for computing text similarity, textual entailment, and
clustering of similar words. See the API documentation for more information.
The API now uses version 5.1.0 of Lucene.
- DISCO API version 2.0: disco-2.0.jar
- Source code of DISCO API: disco-2.0-src.tar.gz
- Example Java class: UseDISCO.java
- API documentation: javadoc
28. February 2013 DISCO API version 1.4
The API contains a new class Compositionality with methods for the computation of the compositional similarity between multi-token words or phrases. Also, the method DISCO.commonContext was added to the API.
- DISCO API version 1.4: disco-1.4.jar
- source code of DISCO API 1.4: disco-1.4-src.zip
- Example Java class: UseDISCO.java
- API documentation for version 1.4: javadoc
16. March 2012 DISCO API version 1.3
The API now uses the latest version 3.5 of Lucene.
The command line option -wl allows to print the complete word frequency list of the language data packet into a text file. This option can also be used to check a downloaded language data packet for errors.
- DISCO API version 1.3: disco-1.3.jar
- Source code of DISCO API: disco-1.3-src.zip
- API documentation for version 1.3: javadoc
24. March 2011 DISCO API version 1.2
Version 1.2 of DISCO API allows to load language data packets (word spaces) into main memory (provided that you have enough RAM). This strongly reduces computation time.
- DISCO API version 1.2: disco-1.2.jar
- Source code of DISCO API: disco-1.2-src.zip
- Example Java class: UseDISCO.java
- API documentation for version 1.2: javadoc
18. September 2008 DISCO API version 1.1
First version of the API being available online.
- DISCO API version 1.1: disco-1.1.jar
- Source code of DISCO API: disco-1.1-src.zip
- Example Java class: UseDISCOv1_1.java
- API documentation for version 1.1: javadoc
Word spaces for old DISCO API version 1.x
These word spaces work with the Protege plugin.
Language | Word Space Name | Corpus Size | Number of Words | Packet Size | API Version | License |
Arabic | ar-general-20120124 | 188 million token | 134,000 | 518 MB | 1.x | |
Czech | cz-general-20080115 | 163 million token | 300,000 | 5.6 GB | 1.x | Apache License 2.0 |
Dutch | nl-general-20081004 | 114 million token | 200,000 | 4.0 GB | 1.x | Apache License 2.0 |
English | enwiki-20130403-sim-lemma-mwl-lc | 1.9 billion token | 420,184 | 2.3 GB | 1.x | Apache License 2.0 |
en-BNC-20080721 | 119 million token | 122,000 | 1.7 GB | 1.x | Apache License 2.0 | |
en-PubMedOA-20070501 | 181 million token | 60,000 | 864 MB | 1.x | Apache License 2.0 | |
en-wikipedia-20080101 | 267 million token | 220,000 | 5.9 GB | 1.x | Apache License 2.0 | |
French | fr-wikipedia-20110201-lemma | 458 million token | 154,000 | 513 MB | 1.x | Apache License 2.0 |
fr-wikipedia-20080713 | 105 million token | 188,000 | 2.4 GB | 1.x | Apache License 2.0 | |
German | de-general-20131219-sim | 977 million token | 246,119 | 2.2 GB | 1.x | Apache License 2.0 |
de-general-20080727 | 400 million token | 200,000 | 3.6 GB | 1.x | ||
Italian | it-general-20080815 | 104 million token | 164,000 | 2.3 GB | 1.x | Apache License 2.0 |
Russian | ru-wikipedia-20110804 | 230 million token | 112,000 | 544 MB | 1.x | Apache License 2.0 |
Spanish | es-general-20080720 | 232 million token | 260,000 | 5.0 GB | 1.x |