# Evaluate Word Representations

Word vectors are often evaulated using some benchmarks, such as <a href="https://jlu.myweb.cs.uwindsor.ca/8380/WS353-Sim.txt">WS353-Sim.txt</a>. The data is a list of pairs of words, along with their similarities. For example, the first a few lines of the file looks like below:

``` (base) Jianguos-Air:ws jianguolu$ more WS353-Sim.txt 
tiger   cat     7.35
tiger   tiger   10.00
plane   car     5.77
train   car     6.31
```

Performance of vector representations are often evaluated using correlation between this ranking and the ranking produced by machine. One example is Speareman correlation. 


In [93]:
from gensim.models import Word2Vec
import gensim
txtfile= open('../data/reuters2.txt','r')
sentences=[line.lower().strip().split(' ') for line in txtfile.readlines()] 
model = gensim.models.Word2Vec(sentences, min_count=2, sg=1, iter=5)
words = list(model.wv.vocab)


In [94]:
test='news'
print('words similar to \''+ test + '\':\t'+ str(model.wv.most_similar(test)))


words similar to 'news':	[('conference', 0.9213701486587524), ('press', 0.8843624591827393), ('reuters', 0.8795915246009827), ('denied', 0.8423165082931519), ('journal', 0.8308773040771484), ('bil', 0.8253540992736816), ('newspaper', 0.8204785585403442), ('richard', 0.8181889057159424), ('reporters', 0.8134483098983765), ('danforth', 0.8100142478942871)]


Read the test data 

In [95]:
data = []
with open("../data/ws/WS353-Sim.txt") as f:
    for line in f:
        x, y, sim = line.strip().lower().split()
        data.append(((x, y), sim))   
data

[(('tiger', 'cat'), '7.35'),
 (('tiger', 'tiger'), '10.00'),
 (('plane', 'car'), '5.77'),
 (('train', 'car'), '6.31'),
 (('television', 'radio'), '6.77'),
 (('media', 'radio'), '7.42'),
 (('bread', 'butter'), '6.19'),
 (('cucumber', 'potato'), '5.92'),
 (('doctor', 'nurse'), '7.00'),
 (('professor', 'doctor'), '6.62'),
 (('student', 'professor'), '6.81'),
 (('smart', 'stupid'), '5.81'),
 (('wood', 'forest'), '7.73'),
 (('money', 'cash'), '9.15'),
 (('king', 'queen'), '8.58'),
 (('king', 'rook'), '5.92'),
 (('bishop', 'rabbi'), '6.69'),
 (('fuck', 'sex'), '9.44'),
 (('football', 'soccer'), '9.03'),
 (('football', 'basketball'), '6.81'),
 (('football', 'tennis'), '6.63'),
 (('arafat', 'jackson'), '2.50'),
 (('physics', 'chemistry'), '7.35'),
 (('vodka', 'gin'), '8.46'),
 (('vodka', 'brandy'), '8.13'),
 (('drink', 'eat'), '6.87'),
 (('car', 'automobile'), '8.94'),
 (('gem', 'jewel'), '8.96'),
 (('journey', 'voyage'), '9.29'),
 (('boy', 'lad'), '8.83'),
 (('coast', 'shore'), '9.10'),
 (('a

In [96]:

results = []
count=0
for (x, y), sim in data:
    if (x in words) & (y in words):
        s=model.similarity(x, y)
        results.append((s, sim))
        print(x+"\t"+y+"\t"+str(s)+"\t"+str(sim))

actual, expected = zip(*results)
actual = np.array(actual,dtype=float)
expected = np.array(expected,dtype=float)

cor= spearmanr(actual, expected)


plane	car	0.845894	5.77
train	car	0.8253301	6.31
television	radio	0.8854684	6.77
media	radio	0.78267133	7.42
wood	forest	0.9491677	7.73
money	cash	0.54670787	9.15
vodka	gin	0.9932268	8.46
car	automobile	0.86125314	8.94
coast	shore	0.79238176	9.10
food	fruit	0.70788664	7.52
money	dollar	0.812971	8.42
money	currency	0.83917785	9.04
skin	eye	0.9569119	6.22
japanese	american	0.32299042	6.50
century	year	0.40324146	7.59
announcement	news	0.735155	7.56
harvard	yale	0.9207133	8.13
travel	activity	0.7749698	5.00
type	kind	0.9445182	8.97
street	place	0.56938833	6.44
street	block	0.67979705	6.88
cell	phone	0.8618718	7.81
dividend	payment	0.7390964	7.63
profit	loss	0.8865849	7.63
dollar	yen	0.83618474	7.78
phone	equipment	0.5708454	7.13
liquid	water	0.81790745	7.89
marathon	sprint	0.89464813	7.47
man	governor	0.7621277	5.25
mexico	brazil	0.7504493	7.44
glass	metal	0.842183	5.56
aluminum	metal	0.9597205	7.83
journal	association	0.38928396	4.97
street	children	0.5339896	4.94
car	flight	0.7774404	4.

In [97]:
print(cor)

SpearmanrResult(correlation=0.36797308936169465, pvalue=0.0009180986798817254)


## Other_Evaluations

There are about 10 other evaluations data sets. Some focuses on rare words, some on verbs. Which method is in favor of which data?