别再死记硬背了!用飞桨PaddlePaddle 2.0手把手教你训练自己的词向量模型(附完整代码) 从零实现SkipGram词向量训练飞桨2.0实战指南自然语言处理中词向量技术早已成为基础但关键的组成部分。不同于传统NLP方法中离散的符号表示词向量通过连续的向量空间捕捉词语之间的语义关系。想象一下当计算机能够理解国王和女王之间的关系类似于男人和女人时我们离真正的语言理解就更近了一步。本文将带你使用飞桨PaddlePaddle 2.0框架从数据准备到模型评估完整实现一个SkipGram词向量模型。1. 环境准备与数据获取在开始之前我们需要确保开发环境配置正确。飞桨PaddlePaddle作为百度开源的深度学习平台其2.0版本在易用性和性能上都有显著提升。以下是环境配置的关键步骤pip install paddlepaddle2.0.0 -i https://mirror.baidu.com/pypi/simple对于词向量训练我们需要大量文本数据作为语料。text8数据集是一个经过清理的英文维基百科语料非常适合作为入门练习。这个数据集已经移除了数字、标点符号和罕见字符只包含小写字母和空格。import requests def download_text8(): url https://dataset.bj.bcebos.com/word2vec/text8.txt response requests.get(url) with open(./text8.txt, wb) as f: f.write(response.content) download_text8()下载完成后我们可以先查看一下数据的基本情况with open(./text8.txt, r) as f: corpus f.read().strip(\n) print(f语料总长度: {len(corpus)}) print(前500个字符示例:) print(corpus[:500])2. 数据预处理与词典构建原始文本数据需要经过一系列预处理才能用于模型训练。首先进行基础的分词处理def preprocess_text(corpus): corpus corpus.lower().split() return corpus processed_corpus preprocess_text(corpus) print(f处理后的词数量: {len(processed_corpus)}) print(前20个词示例:, processed_corpus[:20])接下来构建词典这是将词语转换为模型可处理数字ID的关键步骤。我们按照词频排序高频词获得较小的IDfrom collections import defaultdict def build_vocab(corpus): word_freq defaultdict(int) for word in corpus: word_freq[word] 1 sorted_words sorted(word_freq.items(), keylambda x: x[1], reverseTrue) word2id {word: idx for idx, (word, _) in enumerate(sorted_words)} id2word {idx: word for word, idx in word2id.items()} word_freq {word2id[word]: freq for word, freq in word_freq.items()} return word2id, id2word, word_freq word2id, id2word, word_freq build_vocab(processed_corpus) vocab_size len(word2id) print(f词典大小: {vocab_size}) print(前10个高频词示例:) for i in range(10): print(f{id2word[i]}: {word_freq[i]}次)3. SkipGram模型实现SkipGram模型的核心思想是通过中心词预测上下文词。与CBOW模型不同SkipGram在处理罕见词时表现更好。以下是使用飞桨实现SkipGram的关键步骤3.1 模型结构定义import paddle import paddle.nn as nn class SkipGramModel(nn.Layer): def __init__(self, vocab_size, embedding_dim): super(SkipGramModel, self).__init__() self.vocab_size vocab_size self.embedding_dim embedding_dim # 中心词嵌入层 self.center_embeddings nn.Embedding( vocab_size, embedding_dim, weight_attrpaddle.ParamAttr( initializernn.initializer.Uniform(-0.5/embedding_dim, 0.5/embedding_dim)) ) # 上下文词嵌入层 self.context_embeddings nn.Embedding( vocab_size, embedding_dim, weight_attrpaddle.ParamAttr( initializernn.initializer.Uniform(-0.5/embedding_dim, 0.5/embedding_dim)) ) def forward(self, center_words, context_words, labels): # 获取嵌入向量 center_emb self.center_embeddings(center_words) # [batch_size, embed_dim] context_emb self.context_embeddings(context_words) # [batch_size, embed_dim] # 计算相似度得分 scores paddle.sum(center_emb * context_emb, axis-1) # [batch_size] scores paddle.reshape(scores, [-1]) # 计算损失 loss nn.functional.binary_cross_entropy_with_logits( scores, labels) return loss3.2 负采样实现全词表的softmax计算成本高昂负采样技术通过随机选取少量负样本来近似这一过程import random import math def get_negative_samples(context_word, word_freq, negative_num): negatives [] total_words sum(word_freq.values()) word_probs {word: (freq/total_words)**0.75 for word, freq in word_freq.items()} while len(negatives) negative_num: neg_word random.choices( list(word_probs.keys()), weightslist(word_probs.values()), k1)[0] if neg_word ! context_word: negatives.append(neg_word) return negatives4. 训练数据准备与模型训练4.1 训练样本生成SkipGram模型需要从语料中生成(中心词, 上下文词)对def generate_training_data(corpus, word2id, word_freq, window_size5, negative_ratio5): training_data [] for i in range(window_size, len(corpus)-window_size): center_word corpus[i] context_words corpus[i-window_size:i] corpus[i1:iwindow_size1] # 正样本 for context_word in context_words: training_data.append((center_word, context_word, 1)) # 负样本 negative_samples get_negative_samples( context_word, word_freq, negative_ratio) for neg_word in negative_samples: training_data.append((center_word, neg_word, 0)) return training_data train_data generate_training_data(processed_corpus, word2id, word_freq) print(f生成的训练样本数量: {len(train_data)})4.2 训练过程实现def train_skipgram(model, train_data, word2id, epochs5, batch_size512, learning_rate0.001): optimizer paddle.optimizer.Adam( learning_ratelearning_rate, parametersmodel.parameters()) for epoch in range(epochs): random.shuffle(train_data) total_loss 0 for i in range(0, len(train_data), batch_size): batch train_data[i:ibatch_size] center_words paddle.to_tensor( [word2id[item[0]] for item in batch], dtypeint64) context_words paddle.to_tensor( [word2id[item[1]] for item in batch], dtypeint64) labels paddle.to_tensor( [item[2] for item in batch], dtypefloat32) loss model(center_words, context_words, labels) loss.backward() optimizer.step() optimizer.clear_grad() total_loss loss.numpy()[0] if i % 10000 0: print(fEpoch {epoch1}, Batch {i}, Loss: {loss.numpy()[0]:.4f}) print(fEpoch {epoch1} completed. Average Loss: {total_loss/(len(train_data)/batch_size):.4f}) # 初始化模型 embedding_dim 200 skipgram SkipGramModel(vocab_size, embedding_dim) # 开始训练 train_skipgram(skipgram, train_data, word2id)5. 词向量评估与应用训练完成后我们可以提取词向量并进行评估5.1 词向量提取与保存def save_word_vectors(model, id2word, filename): embeddings model.center_embeddings.weight.numpy() with open(filename, w) as f: f.write(f{len(id2word)} {embeddings.shape[1]}\n) for idx in range(len(id2word)): word id2word[idx] vector .join(map(str, embeddings[idx])) f.write(f{word} {vector}\n) save_word_vectors(skipgram, id2word, word_vectors.txt)5.2 余弦相似度计算余弦相似度是衡量词向量相似度的常用方法import numpy as np def cosine_similarity(vec1, vec2): return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2)) def find_most_similar(word, embeddings, word2id, id2word, topn5): if word not in word2id: print(f{word} not in vocabulary) return word_vec embeddings[word2id[word]] similarities [] for idx in range(len(id2word)): if idx ! word2id[word]: sim cosine_similarity(word_vec, embeddings[idx]) similarities.append((id2word[idx], sim)) similarities.sort(keylambda x: x[1], reverseTrue) return similarities[:topn] # 加载词向量 embeddings skipgram.center_embeddings.weight.numpy() # 查找相似词 test_words [king, france, computer, water] for word in test_words: if word in word2id: similar_words find_most_similar(word, embeddings, word2id, id2word) print(f\n与{word}最相似的词:) for similar, score in similar_words: print(f{similar}: {score:.4f})5.3 词向量可视化为了更好地理解词向量的分布我们可以使用PCA降维后进行可视化import matplotlib.pyplot as plt from sklearn.decomposition import PCA def visualize_word_vectors(words, embeddings, word2id): vectors [embeddings[word2id[word]] for word in words if word in word2id] words [word for word in words if word in word2id] pca PCA(n_components2) reduced pca.fit_transform(vectors) plt.figure(figsize(10, 8)) for i, word in enumerate(words): plt.scatter(reduced[i, 0], reduced[i, 1]) plt.annotate(word, (reduced[i, 0], reduced[i, 1])) plt.show() sample_words [king, queen, man, woman, paris, france, london, england] visualize_word_vectors(sample_words, embeddings, word2id)6. 进阶技巧与优化建议6.1 动态窗口大小传统的SkipGram使用固定窗口大小但实际语言中上下文关系远近不一。实现动态窗口可以提升模型表现def dynamic_window_size(max_window5): return random.randint(1, max_window)6.2 自适应学习率随着训练进行适当降低学习率有助于模型收敛scheduler paddle.optimizer.lr.ReduceOnPlateau( learning_rate0.001, modemin, factor0.5, patience2, verboseTrue)6.3 词向量组合应用训练好的词向量可以进行加减运算捕捉更复杂的语义关系def word_vector_math(word1, word2, word3, embeddings, word2id): if word1 not in word2id or word2 not in word2id or word3 not in word2id: print(Some words not in vocabulary) return None vec embeddings[word2id[word1]] - embeddings[word2id[word2]] embeddings[word2id[word3]] return vec # 示例国王 - 男 女 ≈ 女王 result_vector word_vector_math(king, man, woman, embeddings, word2id) if result_vector is not None: similar_words find_most_similar_vector(result_vector, embeddings, id2word) print(\nking - man woman最接近的词:) for word, score in similar_words: print(f{word}: {score:.4f})