A common evaluation practice in the vector space models (VSMs) literature is to measure the models' ability to predict human judgments about lexical semantic relations between word pairs. Most existing evaluation sets, however, consist of scores collected for English word pairs only, ignoring the potential impact of the judgment language in which word pairs are presented on the human scores.
In this paper we translate two prominent evaluation sets, wordsim353 (association) and SimLex999 (similarity), from English to Italian, German and Russian and collect scores for each dataset from crowdworkers fluent in its language. Our analysis reveals that human judgments are strongly impacted by the judgment language. Moreover, we show that the predictions of monolingual VSMs do not necessarily best correlate with human judgments made with the language used for model training, suggesting that models and humans are affected differently by the language they use when making semantic judgments. Finally, we show that in a large number of setups, multilingual VSM combination results in improved correlations with human judgments, suggesting that multilingualism may partially compensate for the judgment language effect on human judgments.
Multilingual WS353 and Multilingual SimLex999 resources consist of translations of WS353 (word association) and SimLex999 (word similarity) data sets respectively to three languages: German, Italian and Russian. Each of the translated datasets is scored by 13 human judges (crowdworkers) - all fluent speakers of its language. For consistency, we also collected human judgments for the original English corpus according to the same protocol applied to the other languages.
This dataset allows to explore the impact of the "judgement language" (the language in which word pairs are presented to the human judges) on the resulted similarity scores and to evaluate vector space models on a truly multilingual setup (i.e. when both the training and the test data are multilingual).
These datasets can be downloaded in two formats: TXT and XLSX. The TXT format is a CSV based format - if you prefer working with a CSV format change the suffix of the files from ".txt" to ".csv".
Tool for Evaluating Multilingual SimLex999 and WordSim353: GITHUB
The translation and annotation process, as well as related research on the impact of judgment language are described in the following paper. Please cite it if you use this data:
Separated by an Un-common Language: Towards Judgment Language Informed Vector Space Modeling. 2015. Ira Leviant, Roi Reichart . Preprint pubslished on arXiv. arxiv:1508.00106.
*An earlier version of the paper appears with the name "Judgment Language Matters: Towards Judgment Language Informed Vector Space Modeling". Please cite the current version with the name "Separated by an Un-common Language: Towards Judgment Language Informed Vector Space Modeling".
Contact Ira Leviant (firstname.lastname@example.org) if you have any questions.
The well-known Skipgram (Word2Vec) model trained on Wikipedia corpora  achieves Spearman Correlations of (0.266, 0.354, 0.308, 0.260) with SimLex-999 human similarity scores in the corresponding languages: (EN, DE, IT, RU) .
The well-known Skipgram (Word2Vec) model trained on Wikipedia corpora  achieves Spearman Correlations of (0.652, 0.618, 0.614, 0.585) with WordSim353 human similarity scores in the corresponding languages: (EN, DE, IT, RU) .
Please email email@example.com if you know of work outperforming these state-of-the-art scores.