Word2Vec
Word2Vec
词向量最直观的理解就是将每一个单词表征为
深度学习

Quick Start
Python
笔者推荐使用
- Installation
使用pip install word2vec
,然后使用import word2vec
引入
- 文本文件预处理
word2vec.word2phrase('/Users/drodriguez/Downloads/text8', '/Users/drodriguez/Downloads/text8-phrases', verbose=True)
[u'word2phrase', u'-train', u'/Users/drodriguez/Downloads/text8', u'-output', u'/Users/drodriguez/Downloads/text8-phrases', u'-min-count', u'5', u'-threshold', u'100', u'-debug', u'2']
Starting training using file /Users/drodriguez/Downloads/text8
Words processed: 17000K Vocab size: 4399K
Vocab size (unigrams + bigrams): 2419827
Words in train file: 17005206
中文实验
-
语料
首先准备数据:采用网上博客上推荐的全网新闻数据
(SogouCA) ,大小为2.1G 。从ftp上下载数据包SogouCA.tar.gz:
1 wget ftp://ftp.labs.sogou.com/Data/SogouCA/SogouCA.tar.gz --ftp-user=hebin_hit@foxmail.com --ftp-password=4FqLSYdNcrDXvNDi -r
解压数据包:
1 gzip -d SogouCA.tar.gz
2 tar -xvf SogouCA.tar
再将生成的txt文件归并到SogouCA.txt中,取出其中包含content的行并转码,得到语料corpus.txt,大小为2.7G。
1 cat *.txt > SogouCA.txt
2 cat SogouCA.txt | iconv -f gbk -t utf-8 -c | grep "<content>" > corpus.txt
-
分词
用
ANSJ 对corpus.txt 进行分词,得到分词结果resultbig.txt ,大小为3.1G 。在分词工具seg_tool 目录下先编译再执行得到分词结果resultbig.txt ,内含426221 个词,次数总计572308385 个。 -
词向量训练
nohup ./word2vec -train resultbig.txt -output vectors.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -binary 1 &
- 分析
./distance vectors.bin
./distance可以看成计算词与词之间的距离,把词看成向量空间上的一个点,distance看成向量空间上点与点的距离。
在对demo-analogy.sh修改后得到下面几个例子:
法国的首都是巴黎,英国的首都是伦敦,vector("法国") - vector("巴黎) + vector("英国") --> vector("伦敦")"
将经过分词后的语料resultbig.txt中的词聚类并按照类别排序:
1 nohup ./word2vec -train resultbig.txt -output classes.txt -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -classes 500 &
2 sort classes.txt -k 2 -n > classes_sorted_sogouca.txt
先利用经过分词的语料resultbig.txt中得出包含词和短语的文件sogouca_phrase.txt,再训练该文件中词与短语的向量表示。
1 ./word2phrase -train resultbig.txt -output sogouca_phrase.txt -threshold 500 -debug 2
2 ./word2vec -train sogouca_phrase.txt -output vectors_sogouca_phrase.bin -cbow 0 -size 300 -window 10 -negative 0 -hs 1 -sample 1e-3 -threads 12 -binary 1
维基百科实验
Algorithms

CBOW

从输入层到隐层所进行的操作实际就是上下文向量的加和,具体的代码如下。其中

Skip-Gram

$$ \frac{1}{T}\sum^{T}{t=1}\sum{-c \le j \le c, j \ne 0}log p(w_{t+j}|w_t) $$
基本的
$$ P(wo | w_I) = \frac{e^{v{wo}^{T{V*{w_I}}}}}{\Sigma*{w=1}^{W}e^{Vw^{T{V_{w_I}}}}} $$
从公式不难看出,
$$ \frac{1}{T}\sum^{T}{t=1}\sum{-c \le j \le c, j \ne 0}log p(w*{t+j}|w_t) = \ \frac{1}{T}\sum^{T}{t=1}\sum{-c \le j \le c, j \ne 0}log p(w*{t}|w_{t+j}) $$
同时,
与
$$ p(w|wI)=\Pi{j=1}^{L(w)-1}\sigma([n(w,j+1)=ch(n(w,j))]*v’^T_{n(w,j)}v_I) $$
Tricks
Learning Phrases
对于某些词语,经常出现在一起的,我们就判定他们是短语。那么如何衡量呢?用以下公式。
$score(w_i,w_j)=\frac{count(w_iw_j) - \delta}{count(w_i) * count(w_j)}$
输入两个词向量,如果算出的
Implementation
3.Negative Sampling;
参数名 | 说明 | |
---|---|---|
-size | 向量维度 | 一般维度越高越好,但并不总是这样 |
-window | 上下文窗口大小 | Skip-gram—般 |
-sample | 高频词亚采样 | 对大数据集合可以同时提高精度和速度, |
-hs | 是否采用层次 |
层次 |
-negative | 负例数目 | |
-min-count | 被截断的低频词阈值 | |
-alpha | 开始的学习速率 | |
-cbow | 使用 |
Deeplearning4j
Python
%load_ext autoreload
%autoreload 2
word2vec
This notebook is equivalent to demo-word.sh
, demo-analogy.sh
, demo-phrases.sh
and demo-classes.sh
from Google.
Training
Download some data, for example: http://mattmahoney.net/dc/text8.zip
import word2vec
Run word2phrase
to group up similar words “Los Angeles” to “Los_Angeles”
word2vec.word2phrase('/Users/drodriguez/Downloads/text8', '/Users/drodriguez/Downloads/text8-phrases', verbose=True)
[u'word2phrase', u'-train', u'/Users/drodriguez/Downloads/text8', u'-output', u'/Users/drodriguez/Downloads/text8-phrases', u'-min-count', u'5', u'-threshold', u'100', u'-debug', u'2']
Starting training using file /Users/drodriguez/Downloads/text8
Words processed: 17000K Vocab size: 4399K
Vocab size (unigrams + bigrams): 2419827
Words in train file: 17005206
This will create a text8-phrases
that we can use as a better input for word2vec
.Note that you could easily skip this previous step and use the origial data as input for word2vec
.
Train the model using the word2phrase
output.
word2vec.word2vec('/Users/drodriguez/Downloads/text8-phrases', '/Users/drodriguez/Downloads/text8.bin', size=100, verbose=True)
Starting training using file /Users/drodriguez/Downloads/text8-phrases
Vocab size: 98331
Words in train file: 15857306
Alpha: 0.000002 Progress: 100.03% Words/thread/sec: 286.52k
That generated a text8.bin
file containing the word vectors in a binary format.
Do the clustering of the vectors based on the trained model.
word2vec.word2clusters('/Users/drodriguez/Downloads/text8', '/Users/drodriguez/Downloads/text8-clusters.txt', 100, verbose=True)
Starting training using file /Users/drodriguez/Downloads/text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.000002 Progress: 100.02% Words/thread/sec: 287.55k
That created a text8-clusters.txt
with the cluster for every word in the vocabulary
Predictions
import word2vec
Import the word2vec
binary file created above
model = word2vec.load('/Users/drodriguez/Downloads/text8.bin')
We can take a look at the vocabulaty as a numpy array
model.vocab
array([u'</s>', u'the', u'of', ..., u'dakotas', u'nias', u'burlesques'],
dtype='<U78')
Or take a look at the whole matrix
model.vectors.shape
(98331, 100)
model.vectors
array([[ 0.14333282, 0.15825513, -0.13715845, ..., 0.05456942,
0.10955409, 0.00693387],
[ 0.1220774, 0.04939618, 0.09545057, ..., -0.00804222,
-0.05441621, -0.10076696],
[ 0.16844609, 0.03734054, 0.22085373, ..., 0.05854521,
0.04685341, 0.02546694],
...,
[-0.06760896, 0.03737842, 0.09344187, ..., 0.14559349,
-0.11704484, -0.05246212],
[ 0.02228479, -0.07340827, 0.15247506, ..., 0.01872172,
-0.18154132, -0.06813737],
[ 0.02778879, -0.06457976, 0.07102411, ..., -0.00270281,
-0.0471223, -0.135444 ]])
We can retreive the vector of individual words
model['dog'].shape
(100,)
model['dog'][:10]
array([ 0.05753701, 0.0585594, 0.11341395, 0.02016246, 0.11514406,
0.01246986, 0.00801256, 0.17529851, 0.02899276, 0.0203866 ])
We can do simple queries to retreive words similar to “socks” based on cosine similarity:
indexes, metrics = model.cosine('socks')
indexes, metrics
(array([20002, 28915, 30711, 33874, 27482, 14631, 22992, 24195, 25857, 23705]),
array([ 0.8375354, 0.83590846, 0.82818749, 0.82533614, 0.82278399,
0.81476386, 0.8139092, 0.81253798, 0.8105933, 0.80850171]))
This returned a tuple with 2 items:
- numpy array with the indexes of the similar words in the vocabulary
- numpy array with cosine similarity to each word
Its possible to get the words of those indexes
model.vocab[indexes]
array([u'hairy', u'pumpkin', u'gravy', u'nosed', u'plum', u'winged',
u'bock', u'petals', u'biscuits', u'striped'],
dtype='<U78')
There is a helper function to create a combined response: a numpy record array
model.generate_response(indexes, metrics)
rec.array([(u'hairy', 0.8375353970603848), (u'pumpkin', 0.8359084628493809),
(u'gravy', 0.8281874915608026), (u'nosed', 0.8253361379785071),
(u'plum', 0.8227839904046932), (u'winged', 0.8147638561412592),
(u'bock', 0.8139092031538545), (u'petals', 0.8125379796045767),
(u'biscuits', 0.8105933044655644), (u'striped', 0.8085017054444408)],
dtype=[(u'word', '<U78'), (u'metric', '<f8')])
Is easy to make that numpy array a pure python response:
model.generate_response(indexes, metrics).tolist()
[(u'hairy', 0.8375353970603848),
(u'pumpkin', 0.8359084628493809),
(u'gravy', 0.8281874915608026),
(u'nosed', 0.8253361379785071),
(u'plum', 0.8227839904046932),
(u'winged', 0.8147638561412592),
(u'bock', 0.8139092031538545),
(u'petals', 0.8125379796045767),
(u'biscuits', 0.8105933044655644),
(u'striped', 0.8085017054444408)]
Phrases
Since we trained the model with the output of word2phrase
we can ask for similarity of “phrases”
indexes, metrics = model.cosine('los_angeles')
model.generate_response(indexes, metrics).tolist()
[(u'san_francisco', 0.886558000570455),
(u'san_diego', 0.8731961018831669),
(u'seattle', 0.8455603712285231),
(u'las_vegas', 0.8407843553947962),
(u'miami', 0.8341796009062884),
(u'detroit', 0.8235412519780195),
(u'cincinnati', 0.8199138493085706),
(u'st_louis', 0.8160655356728751),
(u'chicago', 0.8156786240847214),
(u'california', 0.8154244925085712)]
Analogies
Its possible to do more complex queries like analogies such as: king - man + woman = queen
This method returns the same as cosine
the indexes of the words in the vocab and the metric
indexes, metrics = model.analogy(pos=['king', 'woman'], neg=['man'], n=10)
indexes, metrics
(array([1087, 1145, 7523, 3141, 6768, 1335, 8419, 1826, 648, 1426]),
array([ 0.2917969, 0.27353295, 0.26877692, 0.26596514, 0.26487509,
0.26428581, 0.26315492, 0.26261258, 0.26136635, 0.26099078]))
model.generate_response(indexes, metrics).tolist()
[(u'queen', 0.2917968955611075),
(u'prince', 0.27353295205311695),
(u'empress', 0.2687769174818083),
(u'monarch', 0.2659651399832089),
(u'regent', 0.26487508713026797),
(u'wife', 0.2642858109968327),
(u'aragon', 0.2631549214361766),
(u'throne', 0.26261257728511833),
(u'emperor', 0.2613663460665488),
(u'bishop', 0.26099078142148696)]
Clusters
clusters = word2vec.load_clusters('/Users/drodriguez/Downloads/text8-clusters.txt')
We can see get the cluster number for individual words
clusters['dog']
11
We can see get all the words grouped on an specific cluster
clusters.get_words_on_cluster(90).shape
(221,)
clusters.get_words_on_cluster(90)[:10]
array(['along', 'together', 'associated', 'relationship', 'deal',
'combined', 'contact', 'connection', 'bond', 'respect'], dtype=object)
We can add the clusters to the word2vec model and generate a response that includes the clusters
model.clusters = clusters
indexes, metrics = model.analogy(pos=['paris', 'germany'], neg=['france'], n=10)
model.generate_response(indexes, metrics).tolist()
[(u'berlin', 0.32333651414395953, 20),
(u'munich', 0.28851564633559, 20),
(u'vienna', 0.2768927258877336, 12),
(u'leipzig', 0.2690537010929304, 91),
(u'moscow', 0.26531859560322785, 74),
(u'st_petersburg', 0.259534503067277, 61),
(u'prague', 0.25000637367753303, 72),
(u'dresden', 0.2495974800117785, 71),
(u'bonn', 0.24403155303236473, 8),
(u'frankfurt', 0.24199720792200027, 31)]