简单教程-word2vec处理英文语料

发布于:2023-04-27 ⋅ 阅读:(281) ⋅ 点赞:(0)

word2vec是google的开源文本处理工具,可以将词处理成向量变成神经网络的输入。word2vec官网

其中提供了一个text8英文语料供我们学习。下载地址:http://mattmahoney.net/dc/text8.zip
网上有很多教我们处理维基百科里面的数据然后变成我们的学习语料的,感兴趣的可以去试试。
该text8语料编码格式UTF-8,所有的数据存储为一行,没有标点符号。我们也可以自己按照这个格式来造数据。

首先安装gensim,word2vec是gensim的一个子模块。

pip3 install  --upgrade gensim

训练模型。

from gensim.models.keyedvectors import KeyedVectors
from gensim.models import word2vec
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
sentences = word2vec.Text8Corpus("text8")   # 加载语料
model = word2vec.Word2Vec(sentences, size=200, windows=5, min_count=5)  
# min-count 表示设置最低频率,默认为5,如果一个词语在文档中出现的次数小于该阈值,那么该词就会被舍弃
# size代表词词向量的维度

完成的参数列表

class gensim.models.word2vec.Word2Vec(sentences=None, size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0.001, seed=1, workers=3, min_alpha=0.0001, sg=0, hs=0, negative=5, cbow_mean=1, hashfxn=<built-in function hash>, iter=5, null_word=0, trim_rule=None, sorted_vocab=1, batch_words=10000, compute_loss=False)

google官网上面提供了一份训练好的模型GoogleNews-vectors-negative300.bin,这是一个C模型,我们可以用下面的方式来加载这个模型并使用。

# 加载C型模型
from gensim.models.keyedvectors import KeyedVectors
from gensim.models import word2vec

model = 
KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
model.save_word2vec_format('GoogleNews-vectors-negative300.txt', binary=False)

我们自己用python训练的模型可以使用下面的方式加载保存。

# 加载python型模型
from gensim.models.keyedvectors import KeyedVectors
from gensim.models import word2vec

model.save("text8.model", ignore=[])
model = KeyedVectors.load("text8.model")

使用模型

# 查看word的词向量
print(model['word'])
print(model['word'][0])

# 查看所有的词
print (model.wv.vocab.keys())

# 计算两个词的相似度/相关程度
y1 = model.similarity("woman", "man")
print ("woman和man的相似度为:", y1)

# 计算某个词的相关词列表
y2 = model.most_similar("good", topn=20)  # 20个最相关的
print ("和good最相关的词有:\n")
for item in y2:
    print (item[0], item[1])

# 寻找对应关系
print (' "boy" is to "father" as "girl" is to ...? \n')
y3 = model.most_similar(['girl', 'father'], ['boy'], topn=3)
for item in y3:
    print (item[0], item[1])

more_examples = ["he his she", "big bigger bad", "going went being"]
for example in more_examples:
    a, b, x = example.split()
    predicted = model.most_similar([x, b], [a])[0][0]
    print ("'%s' is to '%s' as '%s' is to '%s'" % (a, b, x, predicted))

# 寻找不合群的词
y4 = model.doesnt_match("breakfast cereal dinner lunch".split())
print ("不合群的词:", y4)

得到词典

from gensim.models.keyedvectors import KeyedVectors
import logging

# Logging code taken from http://rare-technologies.com/word2vec-tutorial/
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

model = KeyedVectors.load("text8.model")

vocab = model.wv.vocab.keys()

vocab_len = len(vocab)

with open("my_vocab", 'w') as f:
        # For each word in the current chunk...
        for i in range(vocab_len):
            # Write it out and escape any unicode characters.
            f.write(list(vocab)[i] + '\n')
本文含有隐藏内容,请 开通VIP 后查看

网站公告

今日签到

点亮在社区的每一天
去签到