Translation in context
Ontology building
Semantic similarity
Visualisation of lexical fields

DISCO - Download

The DISCO API is open source and licensed under the Apache License, version 2.0.

You need the Java archive disco-3.0.0-all.jar (the DISCO API) and a word space from the table below. Click on a link in the column Word Space Name for a word space description and the download link. Note that the Protege plugin only works with word spaces for DISCO API version 1!

You can construct word spaces from your own corpora with DISCO Builder. It also allows to import vector files that were produced with other tools like fastText, word2vec or GloVe.

Other Downloads:

Source code of DISCO API is on Github: https://github.com/linguatools/disco
See DISCO page on JitPack on how to include DISCO API in your Maven or Gradle project
API documentation: javadoc

Language	Word Space Name	Corpus Size	Number of Words	Storage type	Word Space Type	API Version	License
Arabic	ar-cc-fasttext-col	unknown	2,000,000	DenseMatrix	COL	3.x
	ar-cc-fasttext-sim	unknown	2,000,000	DenseMatrix	SIM	3.x
English	enwiki-20130403-sim-lemma-mwl-lc	1.9 billion token	420,184	DISCOLuceneIndex	SIM	2.x, 3.x
	enwiki-20130403-word2vec-lm-mwl-lc-sim	1.9 billion token	420,184	DenseMatrix	SIM	3.x
	enwiki-20130403-word2vec-lm-mwl-lc-sim	1.9 billion token	420,184	DISCOLuceneIndex	SIM	2.x, 3.x
French	fr-general-20151126-lm-sim	1.9 billion token	276,967	DISCOLuceneIndex	SIM	2.x, 3.x
	fr-general-20151126-lm-word2vec-sim	1.9 billion token	281,484	DISCOLuceneIndex	SIM	2.x, 3.x
German	de-general-20150421-lm-word2vec-sim	1.5 billion token	470,788	DenseMatrix	SIM	3.x
	de-general-20150421-lm-sim	1.5 billion token	470,788	DISCOLuceneIndex	SIM	2.x, 3.x
	de-general-20150503-lm-word2vec-sim	1.5 billion token	470,788	DISCOLuceneIndex	SIM	2.x, 3.x
Russian	ru-ruwac-ruwiki-lm-sim	2.2 billion token	226,108	DISCOLuceneIndex	SIM	2.x, 3.x
	ru-ruwac-ruwiki-lm-word2vec-sim	2.2 billion token	226,108	DISCOLuceneIndex	SIM	2.x, 3.x
	ru-ruwac-ruwiki-lem-col	2.2 billion token	226,108	DISCOLuceneIndex	COL	2.x, 3.x
	ru-ruwac-ruwiki-col	2.2 billion token	508,350	DISCOLuceneIndex	COL	2.x, 3.x

You can find word spaces for the old DISCO API version 1.x at the bottom of this page!
If you want to use the above word spaces with a CC-BY-NC license commercially please contact peter.kolb@linguatools.org.

Version history

28. June 2018 DISCO API version 3.0
Version 3 introduces a new storage class DenseMatrix that is suited for low-dimensional word embeddings. Additionally, there are several new methods in class Compositionality, like computing the shortest path between two words in a word space of type SIM, or an approximate nearest neighbor search to find the most similar word for a given word or word embedding.
DISCO API now has Sux4J as an additional dependency.
Finally, DISCO API is now a Gradle project - see the GitHub repository.

11. August 2015 DISCO API version 2.1
Minor update concerning methods DISCO.semanticSimilarity, TextSimilarity.textSimilarity, and TextSimilarity.directedTextSimilarity. In API version 2.1, the desired similarity measure must be passed to these methods. This is neccessary because the similarity measure SimilarityMeasure.KOLB that was used by default does not produce sensible results with word spaces imported from word2vec.

DISCO API version 2.1: disco-2.1.jar
Source code of DISCO API v2.1: disco-2.1-src.tar.gz
Example Java class: UseDISCO.java
API documentation: javadoc

20. May 2015 DISCO API version 2.0
With the release of DISCO Builder the structure of the word spaces has been changed. Therefore, the new API version 2.0 is not compatible with older word spaces. There are now two types of word spaces: SIM and COL.
Quite a number of methods have been added to the DISCO API, including methods for computing text similarity, textual entailment, and clustering of similar words. See the API documentation for more information.
The API now uses version 5.1.0 of Lucene.

DISCO API version 2.0: disco-2.0.jar
Source code of DISCO API: disco-2.0-src.tar.gz
Example Java class: UseDISCO.java
API documentation: javadoc

28. February 2013 DISCO API version 1.4
The API contains a new class Compositionality with methods for the computation of the compositional similarity between multi-token words or phrases. Also, the method DISCO.commonContext was added to the API.

DISCO API version 1.4: disco-1.4.jar
source code of DISCO API 1.4: disco-1.4-src.zip
Example Java class: UseDISCO.java
API documentation for version 1.4: javadoc

16. March 2012 DISCO API version 1.3
The API now uses the latest version 3.5 of Lucene.
The command line option -wl allows to print the complete word frequency list of the language data packet into a text file. This option can also be used to check a downloaded language data packet for errors.

DISCO API version 1.3: disco-1.3.jar
Source code of DISCO API: disco-1.3-src.zip
API documentation for version 1.3: javadoc

24. March 2011 DISCO API version 1.2
Version 1.2 of DISCO API allows to load language data packets (word spaces) into main memory (provided that you have enough RAM). This strongly reduces computation time.

DISCO API version 1.2: disco-1.2.jar
Source code of DISCO API: disco-1.2-src.zip
Example Java class: UseDISCO.java
API documentation for version 1.2: javadoc

18. September 2008 DISCO API version 1.1
First version of the API being available online.

DISCO API version 1.1: disco-1.1.jar
Source code of DISCO API: disco-1.1-src.zip
Example Java class: UseDISCOv1_1.java
API documentation for version 1.1: javadoc

Word spaces for old DISCO API version 1.x

These word spaces work with the Protege plugin.

Language	Word Space Name	Corpus Size	Number of Words	Packet Size	API Version	License
Arabic	ar-general-20120124	188 million token	134,000	518 MB	1.x
Czech	cz-general-20080115	163 million token	300,000	5.6 GB	1.x	Apache License 2.0
Dutch	nl-general-20081004	114 million token	200,000	4.0 GB	1.x	Apache License 2.0
English	enwiki-20130403-sim-lemma-mwl-lc	1.9 billion token	420,184	2.3 GB	1.x	Apache License 2.0
	en-BNC-20080721	119 million token	122,000	1.7 GB	1.x	Apache License 2.0
	en-PubMedOA-20070501	181 million token	60,000	864 MB	1.x	Apache License 2.0
	en-wikipedia-20080101	267 million token	220,000	5.9 GB	1.x	Apache License 2.0
French	fr-wikipedia-20110201-lemma	458 million token	154,000	513 MB	1.x	Apache License 2.0
	fr-wikipedia-20080713	105 million token	188,000	2.4 GB	1.x	Apache License 2.0
German	de-general-20131219-sim	977 million token	246,119	2.2 GB	1.x	Apache License 2.0
	de-general-20080727	400 million token	200,000	3.6 GB	1.x
Italian	it-general-20080815	104 million token	164,000	2.3 GB	1.x	Apache License 2.0
Russian	ru-wikipedia-20110804	230 million token	112,000	544 MB	1.x	Apache License 2.0
Spanish	es-general-20080720	232 million token	260,000	5.0 GB	1.x