DISCO Builder
Create your own DISCO word spaces with DISCO Builder
Download and installation
Download DISCOBuilder-1.1.1.tar.bz2 and unpack it. You need a Java 8 Runtime Engine.
DISCO Builder is licensed under the
Creative Commons Attribution-NonCommercial license. For commercial use, please contact peter.kolb@linguatools.org.
Getting started
Follow the three steps below to create a DISCO word space from the small test corpus supplied with the DISCO Builder distribution. If this works go on and read the basics section. After that, you are prepared to create word spaces from your corpus as described in one of the sections Create a standard DISCO word space with words as features, Create a word space with documents as features, Create a word space from parsed text. All options are explained thoroughly in the options section. If you want to import a word space from word2vec or GloVe go to the import section.
1. Create an output directory.
2. Edit the file disco.config
in the DISCO Builder directory. DISCO Builder is controlled via this
configuration file. You only have to the parameter outputDirectory
to point to the directory that you created
in step 1. Also, edit the parameter inputDirectory
to point to the directory test-corpus-lemma
in
the DISCO Builder directory. Leave all other parameters unchanged.
3. Start DISCO Builder:
java -jar DISCOBuilder-1.1.0-all.jar disco.config
If the word space was created without error you can examine the contents of the output directory. There will be a lot of
files and one directory called DISCO-idx
. This directory contains your new word space. Never change the names
of any of the files in this directory! Otherwise, the DISCO API will not work with this word space any more. Also, do not
edit the configuration file disco.config
in the DISCO-idx directory. Some API methods read data from this file,
and if they can't find it, you will get a CorruptConfigFileException
.
You can, however, change the name of the word space directory itself. Give it a more informative name than
DISCO-idx
. The word space directory is self-contained, that means you can copy the word space directory to any
location – the other files in the output directory are not needed for querying the word space with the
DISCO API.
Basics: Corpus preprocessing and feature types
In distributional semantics the meaning of a word is learned from the contexts where the word occurs. Basically, context can be defined in two ways:
- the other words around the target word. The plain words can be augmented with their relation to the target word (like window position or syntactic dependency) to build the features. Word spaces of this type are called word-by-word.
- the document where the target word occurs in. Document can also be a sentence or paragraph. Here, the document IDs act as features. Word spaces of this type are called word-by-document.
In order to use features like paragraphs or syntactic relations, they have to be annotated in the input corpus. DISCO can process the following input formats:
- Tokenized text: In this case, the tokens (exactly as they appear in the text) act as words and as features at the same time. The tokens won't be changed by DISCO Builder in any way, i.e. there is no lowercasing or anything. The tokenized text may contain boundary tags like <p> but these tags have to stand on a line of their own. For an example how a tokenized text file looks like see the files in the test-corpus-tokenized directory of the DISCO Builder distribution.
- Lemmatized text: This input format has three tab-separated columns per line: the token, a part-of-speech tag, and the base form (lemma) of the token. In other words, the raw text is augmented with two annotation layers, part-of-speech and lemma. Which layer will act as words or as features can be set in the configuration file with the options lemma and lemmaFeatures. However, the third column doesn't have to be the base form of the token, it may as well be the stem, the lowercased variant, or whatever you may want to trie. In conclusion, the "lemmatized text" input file format makes it possible to use features that are different from the tokens itself (and still create word vectors for the tokens). The lemmatized text may contain boundary tags like <p> on a line of their own. For an example of this file format see the files in the test-corpus-lemmatized directory of the DISCO Builder distribution.
- Parsed text: This has to be the output of a dependency parser. DISCO Builder can read dependency relations in the CoNNL-U format. The features will be words with their respective dependency relation (see below).
Representation of the feature types in the final word space index
Depending on the chosen definition of context, the features will have one of the following three forms:
- word: the features are words. If the input type is lemmatized text, the words can either be the plain tokens from the corpus, or the lemmata (or whatever your input file has in the third column).
- word<SEP>relation: the features are words plus their specific relation to the target word. For tokenized text and lemmatized text, relation can be a window position. For parsed text, it is the dependency relation.
- ID: a number identifying a latent dimension or a document context. This is the case for all word spaces that were imported from other tools (see below) or that were created with documents as features (word-by-document word spaces).
If you build a word space index with the idea of retrieving collocations (like hair → tousled combed permed cropped dyed frizzed greying) make sure to chose settings where the features are words (possibly augmented with a relation) because the collocations will be the features of the word space. If you want to retrieve collocations from a word space where the features are IDs, you will only get a list of numbers.
The two DISCO word space types
The option dontCompute2ndOrder in the disco.config
configuration file determines which type of word space
will be generated.
The relation between a word and its features is called a first-order relation; word and feature co-occur in contexts. For instance, hair and comb co-occur significantly often, therefore comb makes a good feature to describe hair. Other good features for describing hair are grey, black, blonde, curly, dye and so on. All the significant features of a word combined constitute the word vector of the word. If we compare words based on the sets of their first-order relations (in other words, based on their word vectors), we get words that are used in similar contexts (the distributionally similar words). These are called the second-order relations, like hair - fur, beard. Words that are related via a second-order relation may have never occured together in a context.
If you set dontCompute2ndOrder=true then these second-order relations will not be computed by DISCO Builder, creating a word space of type COL where only first-order relations are stored (i.e. the word vectors). If you set dontCompute2ndOrder=false then the second order relations will be computed and stored in the word space for fast retrieval, generating a word space of type SIM. A word space of type SIM stores word vectors (first-order relations) and additionally the most similar words for each word (second-order relations).
The advantage of COL word spaces is that they are smaller and faster to build. However, there are some methods of the DISCO API that only work with word spaces of type SIM. These are the following:
- DISCO.similarWords
- DISCO.secondOrderSimilarity
- Cluster.filterOutliers
- Cluster.growSet
- Cluster.clutoClusterSimilarityGraph
- Rank.rankSim
- Compositionality.similarWordsGraphSearch
- Compositionality.findShortestPath
Create a standard DISCO word space with words as features
You need a tokenized or lemmatized corpus (if you have a parsed corpus, see section Create a word space from parsed text). Put all your corpus files into one directory and set the option inputDir to point to this directory. Depending on the format of your corpus, set the option inputFileFormat to TOKENIZED (for tokenized text) or LEMMATIZED (for lemmatized text in three tab-separated columns per line). Create an output directory and set the option outputDir to point to this directory.
You should always use a stopword list. There are stopword lists for 15 languages in the directory stopword-lists of the DISCO Builder distribution. Set the option stopwordList to point to your stopword list file.
Define a context window. There are two ways to define a window context:
- rightContext, leftContext (optionally position). The DISCO standard context window is rightContext=3, leftContext=3, position=true.
- openingTag, closingTag. This is experimental.
If you have lemmatized text (inputFileFormat=LEMMATIZED), then set lemmaFeatures=true to improve results.
Note that this will still produce word vectors for all wordforms, unless you also set lemma=true (see next
paragraph).
If you have tokenized text (inputFileFormat=TOKENIZED), you have to set lemmaFeatures=false.
If you comment the option out or leave it blank, the default value false will be used.
By setting the option lemma you can build word vectors for wordforms or for lemmata. If you set
lemma=true, vectors for lemmata will be created. The results are generally improved if you do this, but
remember that the resulting word space will only contain lemmata (you will get no result when querying
inflected forms like houses).
The default is false.
Set the minimum word frequency minFreq. The optimal value depends on the size of your corpus (the larger the corpus, the larger
the value for minFreq). For corpus sizes between 100 and 1000 million tokens a value between 20 and 200 is fine.
The default value is 100.
Set numberOfFeatureWords to some value between 10000 and 50000. The default value is 30000.
Option dontCompute2ndOrder determines the word space type. Default is false, i.e. build a SIM space.
After you have edited and saved the configuration file disco.config
start DISCO Builder by typing
java -Xmx<N> -jar DISCOBuilder-1.1.0-all.jar -threads <T> disco.config
with <N>
being enough memory to hold the word space. <T>
is the number of threads to start when
computing the most similar words for each word (this is only relevant if you are building a word space of type SIM).
In summary, you have to set the following options in disco.config
to create a standard DISCO word space:
inputDir=path/to/your/corpus/directory
outputDir=path/to/your/output/directory
# if you have a lemmatized corpus:
inputFileFormat=LEMMATIZED
stopwordFile=/home/peter/DISCO/DISCOBuilder-1.1.0/stopword-lists/stopword-list_de_utf8.txt
# depends on your corpus' annotations:
boundaryMarks=<doc>,<p>,</p>,</article>,<s>,</s>
rightContext=3
leftContext=3
position=true
# if inputFileFormat=LEMMATIZED:
lemmaFeatures=true
# this creates word vectors for lemmata but not for inflected forms:
lemma=true
# if inputFileFormat=TOKENIZED:
#lemmaFeatures=false
#lemma=false
minFreq=100
numberOfFeatureWords=30000
weightingMethod=lin
minWeight=0.1
similarityMeasure=KOLB
# word space type SIM:
dontCompute2ndOrder=false
findMultiTokenWords=false
wordByDocument=false
minimumWordLength=2
maximumWordLength=31
# all Unicode letters (\p{L}) plus the characters listed here:
allowedCharactersWord=.-\'_
minimumFeatureLength=2
maximumFeatureLength=31
# all Unicode letters (\p{L}) plus the characters listed here:
allowedCharactersFeature=.-\'_
# leave these blank:
openingTag=
closingTag=
existingCoocFile=
existingWeightFile=
addInverseRelations=
stopwords=
maxFreq=
tokencount=
vocabularySize=
discoVersion=
Create a word space with documents as features
Set parameter wordByDocument=true. Define what text segment should be your "document" by setting the parameters openingTag and closingTag. E.g. if you have a corpus where paragraphs are annotated with <p> and </p>, then set openingTag=<p> and closingTag=</p> to define paragraphs as document features.
Parameter numberOfFeatureWords will be ignored; the number of features will be equal to the number of documents (as defined by openingTag and closingTag) in the corpus.
Of course you have to also set the parameters inputDir, outputDir, stopwordFile, and inputFileFormat. For other parameters like dontCompute2ndOrder, lemmaFeatures, lemma, and minFreq the same applies as is stated in the previous section.
Create a word space from parsed text
If you have a parsed corpus in CoNNL-U format, set inputFileFormat=CONNL. You should also set addInverseRelations=true. You do not have to define a co-occurrence context because the context is given by the syntactic dependency relations that hold between the words in a sentence. Therefore, leave leftContext, position, openingTag etc. blank.
All other parameters are the same as in section Create a standard DISCO word space. However, you can not use findMultiTokenWords.
Options in disco.config
explained
Options you need to set:
Option | possible values | description |
inputDir | file path | The input directory must contain all your input files (corpus) that are to be processed by DISCO Builder. DISCO Builder will try to process all files in the input directory, regardless of their file name extension. DISCO Builder will not descend into subdirectories! Note that all files must be of the same input file format. |
inputFileFormat | TOKENIZED, LEMMATIZED, CONNL |
|
boundaryMarks | comma-separated list of tags | A window context always stops at a boundary mark. E.g., if you have a window context of +-5 words, and have a end-of-sentence marker </s> set as boundary mark, then the last word of the sentence will have no right context. Boundary marks are ignored if the parameters openingTag and closingTag are set. Tags in your corpus files will only be recognized as boundary marks if nothing else is on the same line. In the example This is a tokenized sentence . This option is not to be confused with the options openingTag and closingTag, which define a context for word-by-document spaces. |
findMultiTokenWords | true, false | Automatically identify multi-token words in corpus and merge them into single tokens by connecting the tokens with an underscore (e.g.: New_York_City). The algorithm described in Mikolov et al. 2013 is applied, n-grams of size 2 and 3 are considered. The multi-token words found in the corpus are written to the file disco.phraseFreqs together with their corpus frequencies. This is not applicable with inputFileFormat=CONNL! This step is quite time and memory consuming! |
multiTokenWordDictionary | file path | Specify a dictionary with multi-token words to be used with findMultiTokenWords additionally to the phrases found automatically. The format of the dictionary is one multi-token word per line. The tokens can be separated by space or underscore. |
outputDir | path | directory where the word space directory DISCO-idx will be created and
where other, temporary files are written to. |
stopwordFile | path | path to a stopword list that contains one stopword per line. Stopwords will be ignored. Note that the stopwords have to match the words in the input corpus exactly, there is no lowercasing. If you only supply the stopword and, but not And, then the occurrence in ... road. And there was no... will not be ignored. Moreover, if you use lemmata (lemmaFeatures=true or lemma=true), then your stopwords also have to include lemmata. |
lemma | true, false | This parameter decides what you will be able to look up in the final word space: inflected word forms or lemmata (base forms). If you set this parameter to true, only lemmata will be stored, but no inflected word forms. That means you will be able to look up speak but not speaks. Word space quality will be higher for lemma=true, because the data for all the word forms of each lemma will be combined, which leads to more reliable statistics. lemma=true is not allowed if inputFileFormat is TOKENIZED. |
rightContext | 0..n | Size of context word window to the right of the target word (in token). Default value is 3. |
leftContext | 0..n | Size of context word window to the left of the target word (in token). Default value is 3. |
position | true, false | If true, the position of the feature word in the context word window is added to the feature word to create the feature. For instance, if the feature word barks occurs directly after the target word dog, the feature will be barks_1. The effect of this parameter is a stricter context that leads to tighter similarities (comparable to syntactic dependency relations). This parameter is only relevant in conjunction with rightContext and leftContext. The default value is true. |
minFreq | 1..n | Here you can select the minimum number of times a token has to occur in the corpus in order to be indexed. Words with a smaller frequency will be completely ignored by DISCO Builder, i.e. they won't be present in the resulting word space and they will also not be used as feature words. Normally, the minimum frequency should be some number between 20 and 100. In order to minimize the size of the resulting word space and the computation time larger values can be selected, for example 200 or even 500. The absolute minimum number should be 2 to at least filter out hapax legomena (words occurring only once), which effectively halves the number of word types that will have to be indexed. |
lemmaFeatures | true, false | If your inputFileFormat is LEMMATIZED or CONNL, you should always set this parameter to true, in order to use the word's base forms (lemmata) as features. This increases word space quality. If you have TOKENIZED input text, you have to set this parameter to false. |
numberFeatureWords | 1..V | The maximum number of words to be used as features. If you set this to n then the n most frequent words will be used as feature words. Note that this is independent of additional relations you possibly have activated (like window position using position=true). For instance, if you have set numberFeatureWords=10000, position=true with a +-3 words context, then your word space will have (at most) 10,000 x 6 = 60,000 features. This is because a word like eat can occur in any of the 6 window positions relative to the target word, giving rise to 6 different features: eat_1, eat_2, eat_3, ... The same holds for syntactic dependency relations (eat_N:subjOf:V, eat_A:mod:V, ...). In general, the number of features is numberFeatureWords x numberOfRelations. However, this is the upper bound since not all feature words will occur with all relations. Normally, the value of numberFeatureWords should be in the range 10,000 - 100,000. If you create a word space that is intended for the look-up of collocations only (with option dontCompute2ndOrder=true) then you should use a higher value to include all words from the vocabulary in the set of possible collocates. |
weightingMethod | lin, loglikelihood, poisson, relative frequency | This determines which measure to use to compute the significance of a word's features. The standard measure for semantic similarity in DISCO is lin. If you want to retrieve collocations loglikelihood is a good choice. |
minWeight | float | The minimum significance value of a feature to be included in a word's word vector. A good choice for the lin measure is 0.1. For loglikelihood use a larger threshold, like 10.0. |
similarityMeasure | cosine, kolb | The method to compute the similarity between two word vectors. This option only applies if you build a word space of type SIM (dontCompute2ndOrder=false). |
dontCompute2ndOrder | true, false | Determines which word space type to build. See section on word space types. |
addInverseRelations | true, false | Only relevant if inputFileFormat=CONNL. If true the inverse of a dependency relation is added as feature. For instance, if the input file has the dependency relation horse SUBJ_OF gallop then gallop<SEP>SUBJ_OF is used as a feature to describe horse. If addInverseRelations=true then horse<SEP>SUBJ_OF_INV is added as a feature to describe gallop. This option should be set to true, except the input files already contain the inverse relations. |
wordByDocument | true, false | If this parameter is true, then a word-by-document word space will be created. In this case, both parameters openingTag and closingTag have to be set to define the document context. |
openingTag, closingTag | String | These two parameters have to be set when wordByDocument=true to define the document context. When wordByDocument=false these two parameters override rightContext and leftContext. Parameter position is always regarded as false when openingTag and closingTag are set. |
existingCoocFile | file path | tbd... |
existingWeightFile | file path | tbd... |
allowedCharactersFeature | String | The characters that are allowed to occur in a word used as feature. The set of unicode letters (\p{L}) is added to this automatically. If a word contains other letters it is ignored. |
maximumFeatureLength | The maximum length of a word (in characters) to be used as feature. All words longer than this are ignored and not used as features. | |
minimumFeatureLength | The minimum length of a word (in characters) to be used as feature. | |
maximumWordLength | The maximum length of a word (in characters) to be indexed. All words longer than this are ignored, no word vectors will be build for them and they cannot be looked up in the final word space. | |
allowedCharactersWord | Set of allowed characters in a word. The set of all unicode letters (\p{L}) is added to this by DISCO Builder automatically. | |
minimumWordLength | Words shorter than this are ignored. |
Don't touch these: the following options have to be left blank because they will be filled by DISCO Builder:
tokencount
vocabularySize
maxFreq
stopwordList
discoVersion
Import word spaces from other tools
Import fastText, word2vec or GloVe vectors
DISCO Builder allows to convert vector files produced with fastText, word2vec or GloVe into a DISCO word space index that can be queried with the DISCO API.
java -Xmx<N> -cp DISCOBuilder-1.1.0-all.jar de.linguatools.disco.builder.Import
-in <vectorFile>
-out <outputDir>
-wsType COL|SIM
[-storageType DISCOLuceneIndex|DenseMatrix]
[-threads <N>]
[-wlfreq <wordFrequencyList> | -corpus <corpusFile>]
[-nBest <N>]
where <vectorFile>
is the vector file (in text format) created by fastText, word2vec or GloVe and
<outputDir>
is the directory where the DISCO word space will be written.
Option -wsType
specifies the DISCO word space type that will be created.
The type COL
only stores the word vectors, whereas the type SIM
also computes the most similar words for each word
and stores these lists of similar words, too.
If -wsType SIM
you should specify the number of threads to run using the -threads
option
(-threads
has no effect when -wsType COL
).
If -wsType SIM
you can specify how many similar words to store for each word using the option
-nBest
. The default value is 300. This option is ignored when -wsType COL
.
Option -storageType
determines the way the word space will be stored. DISCOLuceneIndex
uses a
Lucene index and is only suited for high-dimensional sparse matrixes that don't
fit into a dense matrix in memory. For word embeddings you should use DenseMatrix
which also is the default.
Some methods of the DISCO API need corpus information like the frequency of the words.
Since these informations are not contained in the vector files, you can supply a <wordFrequencyList>
(format one word with its frequency per line, separated by white space). Alternatively, you can supply the corpus file
itself using the option <corpusFile>
.
See the README page of the DISCO API GitHub repository for an example of how to import a FastText vector file.
Convert storage types and word space types
Convert DISCOLuceneIndex to DenseMatrix
You can convert a word space of storage type DISCOLuceneIndex into an equivalent word space with storage type DenseMatrix using the following command:
java -Xmx<N> -cp DISCOBuilder-1.1.0-all.jar de.linguatools.disco.builder.DenseMatrixFactory
<DISCOLuceneIndex> <OutputDenseMatrix> <NumberOfSimilarWords>
where <DISCOLuceneIndex>
is the input word space. It must be of storage type DISCOLuceneIndex and compatible with DISCO API version 2.x or 3.x.
<OutputDenseMatrix>
is the resulting DenseMatrix file (compatible with DISCO API version 3.x).
<NumberOfSimilarWords>
: number of similar words to store in the output DenseMatrix word space. Allowed values are 0 .. numberOfSimilarWords
in disco.config
file in input DISCOLuceneIndex. 0
will produce a word space of type COL. Note that you can not create a SIM word space from a COL
word space. In case the property numberOfSimilarWords
is missing in the disco.config
file run the above command with some high value for
numberOfSimilarWords
(e.g. 1000
). DISCOBuilder will then tell you the numberOfSimilarWords you should use. (If you get an out-of-memory exception
use a lower value or increase heap space <N>
.)
Important: unless you have a terabyte of memory this only works with low-dimensional word embeddings that have been imported from word2vec or fastText! High-dimensional distributional count vectors (those that are created by DISCOBuilder from a text corpus) are too large to be stored in a dense matrix.
Convert DenseMatrix of type COL into SIM
Compute nBest most similar words for each word and store them in DenseMatrix. This turns a DenseMatrix of word space type COL into one of type SIM.
java -Xmx<N> -cp DISCOBuilder-1.1.0-all.jar de.linguatools.disco.builder.DenseMatrixBuilder
-in <denseMatrixFile>
-out <denseMatrixOutputFile>
[-threads <N>] (default=1)
[-simMeasure <DISCO.SimilarityMeasure>] (default=COSINE)
[-nBest <N>] (default=300)
Evaluation of your word spaces
DISCO Builder contains a method for evaluating a word space against word pairs with gold standard similarity values in a CSV file. The CSV file should have the format
word1,word2,similarity
The first line of the CSV file is regarded as header and is ignored. To start the evaluation type:
java -Xmx<N> -cp DISCOBuilder-1.0.jar
de.linguatools.disco.builder.Evaluate <csvFile> <wordSpaceDir> <DISCO.SimilarityMeasure> <separator>
with <N>
being enough memory to hold the word space, and separator
the character used as
separator in the csvFile
.
The method will compute the Spearman rank correlation coefficient between the gold standard similarities and
the similarities computed by DISCO.
You can find evaluation data for several languages here:
- Manaal Faruqui's link list
- Multilingual WS353
- Monolingual and Cross-lingual Word Similarity Datasets
Share your word spaces
If you have built a word space that you would like to share with others, drop us a note. We will be happy to link your word space on the DISCO download page.