从零实现Skip-gram用Python拆解Word2Vec核心逻辑在自然语言处理领域Word2Vec无疑是里程碑式的算法。许多教程会告诉你Skip-gram的数学公式但真正动手实现时那些优雅的符号往往变成了一团乱麻。本文将带你用Python从零构建一个可运行的Word2Vec模型通过代码揭示词向量背后的魔法。1. 环境准备与数据预处理1.1 基础工具配置我们需要以下Python库来构建模型import numpy as np import matplotlib.pyplot as plt from collections import Counter import random1.2 构建简易语料库考虑这个微型语料库它足够小便于调试又能体现关键特征corpus [ the quick brown fox jumps over the lazy dog, i love natural language processing, word embeddings capture semantic meaning ]1.3 词汇表构建技巧使用生成器表达式高效构建词汇表def build_vocab(corpus, min_count1): words [word for sentence in corpus for word in sentence.split()] word_counts Counter(words) vocab {word: idx for idx, (word, count) in enumerate(word_counts.items()) if count min_count} reverse_vocab {idx: word for word, idx in vocab.items()} return vocab, reverse_vocab2. Skip-gram模型架构设计2.1 双矩阵设计原理Skip-gram的核心是两个权重矩阵中心词矩阵W形状为[vocab_size, embedding_dim]背景词矩阵W形状为[embedding_dim, vocab_size]初始化时需要注意的细节embedding_dim 100 vocab_size len(vocab) # Xavier/Glorot初始化 W np.random.randn(vocab_size, embedding_dim) * np.sqrt(2./(vocab_size embedding_dim)) W_prime np.random.randn(embedding_dim, vocab_size) * np.sqrt(2./(embedding_dim vocab_size))2.2 滑动窗口生成器实现一个高效的上下文窗口生成器def generate_training_data(corpus, window_size2): for sentence in corpus: words sentence.split() for i, center_word in enumerate(words): context_start max(0, i - window_size) context_end min(len(words), i window_size 1) for j in range(context_start, context_end): if j ! i and 0 j len(words): yield (center_word, words[j])3. 负采样优化实现3.1 采样分布计算负采样的关键在于设计合理的采样分布def get_negative_samples(word_counts, power0.75): total sum([count**power for count in word_counts.values()]) sampling_prob {word: (count**power)/total for word, count in word_counts.items()} return sampling_prob3.2 高效负采样使用NumPy的随机选择加速采样过程def negative_sampling_batch(sampling_prob, vocab, k5): words list(vocab.keys()) probs [sampling_prob[word] for word in words] return np.random.choice(words, sizek, pprobs, replaceFalse)4. 训练循环与参数更新4.1 前向传播计算实现向量查找和得分计算def forward_pass(center_word, context_word, W, W_prime, vocab): center_idx vocab[center_word] context_idx vocab[context_word] h W[center_idx] # 中心词向量 u W_prime[:, context_idx] # 背景词向量 # 正样本得分 positive_score np.dot(h, u) return h, u, positive_score4.2 损失函数设计实现负采样版本的损失函数def compute_loss(positive_score, negative_scores): pos_loss -np.log(sigmoid(positive_score)) neg_loss -np.sum(np.log(sigmoid(-negative_scores))) return pos_loss neg_loss4.3 参数更新规则手动实现梯度计算和参数更新def update_parameters(center_word, context_word, negative_samples, W, W_prime, vocab, learning_rate0.01): center_idx vocab[center_word] context_idx vocab[context_word] negative_indices [vocab[word] for word in negative_samples] h W[center_idx] # 正样本梯度 positive_grad sigmoid(-np.dot(h, W_prime[:, context_idx])) - 1 W_prime[:, context_idx] - learning_rate * positive_grad * h # 负样本梯度 for neg_idx in negative_indices: negative_grad sigmoid(np.dot(h, W_prime[:, neg_idx])) W_prime[:, neg_idx] - learning_rate * negative_grad * h # 更新中心词向量 context_grad (sigmoid(-np.dot(h, W_prime[:, context_idx])) - 1) * W_prime[:, context_idx] negative_grads sum([sigmoid(np.dot(h, W_prime[:, neg_idx])) * W_prime[:, neg_idx] for neg_idx in negative_indices]) W[center_idx] - learning_rate * (context_grad negative_grads) return W, W_prime5. 训练过程可视化与调试5.1 损失曲线监控添加实时损失记录功能def train_model(corpus, epochs10, learning_rate0.01): losses [] for epoch in range(epochs): epoch_loss 0 for center, context in generate_training_data(corpus): # 训练步骤... epoch_loss batch_loss losses.append(epoch_loss) print(fEpoch {epoch1}, Loss: {epoch_loss:.4f}) plt.plot(losses) plt.xlabel(Epoch) plt.ylabel(Loss) plt.title(Training Loss Curve) plt.show()5.2 词向量可视化使用PCA降维展示训练结果from sklearn.decomposition import PCA def visualize_embeddings(W, vocab, reverse_vocab, top_n20): pca PCA(n_components2) reduced pca.fit_transform(W[:top_n]) plt.figure(figsize(10,8)) for i in range(top_n): plt.scatter(reduced[i,0], reduced[i,1]) plt.annotate(reverse_vocab[i], (reduced[i,0], reduced[i,1])) plt.title(Word Embeddings Visualization) plt.show()6. 模型应用与效果验证6.1 相似词查找实现余弦相似度计算def most_similar(word, W, vocab, reverse_vocab, top_k5): if word not in vocab: return [] word_vec W[vocab[word]] similarities [] for idx, vec in enumerate(W): if idx vocab[word]: continue cos_sim np.dot(word_vec, vec) / (np.linalg.norm(word_vec) * np.linalg.norm(vec)) similarities.append((reverse_vocab[idx], cos_sim)) return sorted(similarities, keylambda x: -x[1])[:top_k]6.2 类比推理测试实现经典的king - man woman测试def word_analogy(a, b, c, W, vocab): for word in [a, b, c]: if word not in vocab: return None vec W[vocab[b]] - W[vocab[a]] W[vocab[c]] similarities [] for idx, word_vec in enumerate(W): cos_sim np.dot(vec, word_vec) / (np.linalg.norm(vec) * np.linalg.norm(word_vec)) similarities.append((reverse_vocab[idx], cos_sim)) return sorted(similarities, keylambda x: -x[1])[:5]7. 性能优化技巧7.1 批量训练加速将多个样本组合成小批量def generate_batches(corpus, batch_size32): batch [] for pair in generate_training_data(corpus): batch.append(pair) if len(batch) batch_size: yield batch batch [] if batch: yield batch7.2 学习率衰减策略实现指数衰减学习率class LearningRateScheduler: def __init__(self, initial_lr, decay_rate, decay_steps): self.lr initial_lr self.decay_rate decay_rate self.decay_steps decay_steps self.steps 0 def step(self): self.steps 1 if self.steps % self.decay_steps 0: self.lr * self.decay_rate return self.lr在实现过程中我发现embedding_dim设置为100-300之间效果最佳而负采样数量k5-10对于小型语料库已经足够。学习率初始值0.025配合指数衰减每1000步衰减0.98能获得稳定的训练效果。
别再死记硬背Skip-gram公式了!用Python从零实现一个Word2Vec模型(附完整代码)
发布时间:2026/6/1 17:49:42
从零实现Skip-gram用Python拆解Word2Vec核心逻辑在自然语言处理领域Word2Vec无疑是里程碑式的算法。许多教程会告诉你Skip-gram的数学公式但真正动手实现时那些优雅的符号往往变成了一团乱麻。本文将带你用Python从零构建一个可运行的Word2Vec模型通过代码揭示词向量背后的魔法。1. 环境准备与数据预处理1.1 基础工具配置我们需要以下Python库来构建模型import numpy as np import matplotlib.pyplot as plt from collections import Counter import random1.2 构建简易语料库考虑这个微型语料库它足够小便于调试又能体现关键特征corpus [ the quick brown fox jumps over the lazy dog, i love natural language processing, word embeddings capture semantic meaning ]1.3 词汇表构建技巧使用生成器表达式高效构建词汇表def build_vocab(corpus, min_count1): words [word for sentence in corpus for word in sentence.split()] word_counts Counter(words) vocab {word: idx for idx, (word, count) in enumerate(word_counts.items()) if count min_count} reverse_vocab {idx: word for word, idx in vocab.items()} return vocab, reverse_vocab2. Skip-gram模型架构设计2.1 双矩阵设计原理Skip-gram的核心是两个权重矩阵中心词矩阵W形状为[vocab_size, embedding_dim]背景词矩阵W形状为[embedding_dim, vocab_size]初始化时需要注意的细节embedding_dim 100 vocab_size len(vocab) # Xavier/Glorot初始化 W np.random.randn(vocab_size, embedding_dim) * np.sqrt(2./(vocab_size embedding_dim)) W_prime np.random.randn(embedding_dim, vocab_size) * np.sqrt(2./(embedding_dim vocab_size))2.2 滑动窗口生成器实现一个高效的上下文窗口生成器def generate_training_data(corpus, window_size2): for sentence in corpus: words sentence.split() for i, center_word in enumerate(words): context_start max(0, i - window_size) context_end min(len(words), i window_size 1) for j in range(context_start, context_end): if j ! i and 0 j len(words): yield (center_word, words[j])3. 负采样优化实现3.1 采样分布计算负采样的关键在于设计合理的采样分布def get_negative_samples(word_counts, power0.75): total sum([count**power for count in word_counts.values()]) sampling_prob {word: (count**power)/total for word, count in word_counts.items()} return sampling_prob3.2 高效负采样使用NumPy的随机选择加速采样过程def negative_sampling_batch(sampling_prob, vocab, k5): words list(vocab.keys()) probs [sampling_prob[word] for word in words] return np.random.choice(words, sizek, pprobs, replaceFalse)4. 训练循环与参数更新4.1 前向传播计算实现向量查找和得分计算def forward_pass(center_word, context_word, W, W_prime, vocab): center_idx vocab[center_word] context_idx vocab[context_word] h W[center_idx] # 中心词向量 u W_prime[:, context_idx] # 背景词向量 # 正样本得分 positive_score np.dot(h, u) return h, u, positive_score4.2 损失函数设计实现负采样版本的损失函数def compute_loss(positive_score, negative_scores): pos_loss -np.log(sigmoid(positive_score)) neg_loss -np.sum(np.log(sigmoid(-negative_scores))) return pos_loss neg_loss4.3 参数更新规则手动实现梯度计算和参数更新def update_parameters(center_word, context_word, negative_samples, W, W_prime, vocab, learning_rate0.01): center_idx vocab[center_word] context_idx vocab[context_word] negative_indices [vocab[word] for word in negative_samples] h W[center_idx] # 正样本梯度 positive_grad sigmoid(-np.dot(h, W_prime[:, context_idx])) - 1 W_prime[:, context_idx] - learning_rate * positive_grad * h # 负样本梯度 for neg_idx in negative_indices: negative_grad sigmoid(np.dot(h, W_prime[:, neg_idx])) W_prime[:, neg_idx] - learning_rate * negative_grad * h # 更新中心词向量 context_grad (sigmoid(-np.dot(h, W_prime[:, context_idx])) - 1) * W_prime[:, context_idx] negative_grads sum([sigmoid(np.dot(h, W_prime[:, neg_idx])) * W_prime[:, neg_idx] for neg_idx in negative_indices]) W[center_idx] - learning_rate * (context_grad negative_grads) return W, W_prime5. 训练过程可视化与调试5.1 损失曲线监控添加实时损失记录功能def train_model(corpus, epochs10, learning_rate0.01): losses [] for epoch in range(epochs): epoch_loss 0 for center, context in generate_training_data(corpus): # 训练步骤... epoch_loss batch_loss losses.append(epoch_loss) print(fEpoch {epoch1}, Loss: {epoch_loss:.4f}) plt.plot(losses) plt.xlabel(Epoch) plt.ylabel(Loss) plt.title(Training Loss Curve) plt.show()5.2 词向量可视化使用PCA降维展示训练结果from sklearn.decomposition import PCA def visualize_embeddings(W, vocab, reverse_vocab, top_n20): pca PCA(n_components2) reduced pca.fit_transform(W[:top_n]) plt.figure(figsize(10,8)) for i in range(top_n): plt.scatter(reduced[i,0], reduced[i,1]) plt.annotate(reverse_vocab[i], (reduced[i,0], reduced[i,1])) plt.title(Word Embeddings Visualization) plt.show()6. 模型应用与效果验证6.1 相似词查找实现余弦相似度计算def most_similar(word, W, vocab, reverse_vocab, top_k5): if word not in vocab: return [] word_vec W[vocab[word]] similarities [] for idx, vec in enumerate(W): if idx vocab[word]: continue cos_sim np.dot(word_vec, vec) / (np.linalg.norm(word_vec) * np.linalg.norm(vec)) similarities.append((reverse_vocab[idx], cos_sim)) return sorted(similarities, keylambda x: -x[1])[:top_k]6.2 类比推理测试实现经典的king - man woman测试def word_analogy(a, b, c, W, vocab): for word in [a, b, c]: if word not in vocab: return None vec W[vocab[b]] - W[vocab[a]] W[vocab[c]] similarities [] for idx, word_vec in enumerate(W): cos_sim np.dot(vec, word_vec) / (np.linalg.norm(vec) * np.linalg.norm(word_vec)) similarities.append((reverse_vocab[idx], cos_sim)) return sorted(similarities, keylambda x: -x[1])[:5]7. 性能优化技巧7.1 批量训练加速将多个样本组合成小批量def generate_batches(corpus, batch_size32): batch [] for pair in generate_training_data(corpus): batch.append(pair) if len(batch) batch_size: yield batch batch [] if batch: yield batch7.2 学习率衰减策略实现指数衰减学习率class LearningRateScheduler: def __init__(self, initial_lr, decay_rate, decay_steps): self.lr initial_lr self.decay_rate decay_rate self.decay_steps decay_steps self.steps 0 def step(self): self.steps 1 if self.steps % self.decay_steps 0: self.lr * self.decay_rate return self.lr在实现过程中我发现embedding_dim设置为100-300之间效果最佳而负采样数量k5-10对于小型语料库已经足够。学习率初始值0.025配合指数衰减每1000步衰减0.98能获得稳定的训练效果。