别再死记硬背Skip-gram公式了！用Python从零实现一个Word2Vec模型（附完整代码）

发布时间：2026/6/1 17:49:42

从零实现Skip-gram用Python拆解Word2Vec核心逻辑在自然语言处理领域Word2Vec无疑是里程碑式的算法。许多教程会告诉你Skip-gram的数学公式但真正动手实现时那些优雅的符号往往变成了一团乱麻。本文将带你用Python从零构建一个可运行的Word2Vec模型通过代码揭示词向量背后的魔法。1. 环境准备与数据预处理1.1 基础工具配置我们需要以下Python库来构建模型import numpy as np import matplotlib.pyplot as plt from collections import Counter import random1.2 构建简易语料库考虑这个微型语料库它足够小便于调试又能体现关键特征corpus [ the quick brown fox jumps over the lazy dog, i love natural language processing, word embeddings capture semantic meaning ]1.3 词汇表构建技巧使用生成器表达式高效构建词汇表def build_vocab(corpus, min_count1): words [word for sentence in corpus for word in sentence.split()] word_counts Counter(words) vocab {word: idx for idx, (word, count) in enumerate(word_counts.items()) if count min_count} reverse_vocab {idx: word for word, idx in vocab.items()} return vocab, reverse_vocab2. Skip-gram模型架构设计2.1 双矩阵设计原理Skip-gram的核心是两个权重矩阵中心词矩阵W形状为[vocab_size, embedding_dim]背景词矩阵W形状为[embedding_dim, vocab_size]初始化时需要注意的细节embedding_dim 100 vocab_size len(vocab) # Xavier/Glorot初始化 W np.random.randn(vocab_size, embedding_dim) * np.sqrt(2./(vocab_size embedding_dim)) W_prime np.random.randn(embedding_dim, vocab_size) * np.sqrt(2./(embedding_dim vocab_size))2.2 滑动窗口生成器实现一个高效的上下文窗口生成器def generate_training_data(corpus, window_size2): for sentence in corpus: words sentence.split() for i, center_word in enumerate(words): context_start max(0, i - window_size) context_end min(len(words), i window_size 1) for j in range(context_start, context_end): if j ! i and 0 j len(words): yield (center_word, words[j])3. 负采样优化实现3.1 采样分布计算负采样的关键在于设计合理的采样分布def get_negative_samples(word_counts, power0.75): total sum([count**power for count in word_counts.values()]) sampling_prob {word: (count**power)/total for word, count in word_counts.items()} return sampling_prob3.2 高效负采样使用NumPy的随机选择加速采样过程def negative_sampling_batch(sampling_prob, vocab, k5): words list(vocab.keys()) probs [sampling_prob[word] for word in words] return np.random.choice(words, sizek, pprobs, replaceFalse)4. 训练循环与参数更新4.1 前向传播计算实现向量查找和得分计算def forward_pass(center_word, context_word, W, W_prime, vocab): center_idx vocab[center_word] context_idx vocab[context_word] h W[center_idx] # 中心词向量 u W_prime[:, context_idx] # 背景词向量 # 正样本得分 positive_score np.dot(h, u) return h, u, positive_score4.2 损失函数设计实现负采样版本的损失函数def compute_loss(positive_score, negative_scores): pos_loss -np.log(sigmoid(positive_score)) neg_loss -np.sum(np.log(sigmoid(-negative_scores))) return pos_loss neg_loss4.3 参数更新规则手动实现梯度计算和参数更新def update_parameters(center_word, context_word, negative_samples, W, W_prime, vocab, learning_rate0.01): center_idx vocab[center_word] context_idx vocab[context_word] negative_indices [vocab[word] for word in negative_samples] h W[center_idx] # 正样本梯度 positive_grad sigmoid(-np.dot(h, W_prime[:, context_idx])) - 1 W_prime[:, context_idx] - learning_rate * positive_grad * h # 负样本梯度 for neg_idx in negative_indices: negative_grad sigmoid(np.dot(h, W_prime[:, neg_idx])) W_prime[:, neg_idx] - learning_rate * negative_grad * h # 更新中心词向量 context_grad (sigmoid(-np.dot(h, W_prime[:, context_idx])) - 1) * W_prime[:, context_idx] negative_grads sum([sigmoid(np.dot(h, W_prime[:, neg_idx])) * W_prime[:, neg_idx] for neg_idx in negative_indices]) W[center_idx] - learning_rate * (context_grad negative_grads) return W, W_prime5. 训练过程可视化与调试5.1 损失曲线监控添加实时损失记录功能def train_model(corpus, epochs10, learning_rate0.01): losses [] for epoch in range(epochs): epoch_loss 0 for center, context in generate_training_data(corpus): # 训练步骤... epoch_loss batch_loss losses.append(epoch_loss) print(fEpoch {epoch1}, Loss: {epoch_loss:.4f}) plt.plot(losses) plt.xlabel(Epoch) plt.ylabel(Loss) plt.title(Training Loss Curve) plt.show()5.2 词向量可视化使用PCA降维展示训练结果from sklearn.decomposition import PCA def visualize_embeddings(W, vocab, reverse_vocab, top_n20): pca PCA(n_components2) reduced pca.fit_transform(W[:top_n]) plt.figure(figsize(10,8)) for i in range(top_n): plt.scatter(reduced[i,0], reduced[i,1]) plt.annotate(reverse_vocab[i], (reduced[i,0], reduced[i,1])) plt.title(Word Embeddings Visualization) plt.show()6. 模型应用与效果验证6.1 相似词查找实现余弦相似度计算def most_similar(word, W, vocab, reverse_vocab, top_k5): if word not in vocab: return [] word_vec W[vocab[word]] similarities [] for idx, vec in enumerate(W): if idx vocab[word]: continue cos_sim np.dot(word_vec, vec) / (np.linalg.norm(word_vec) * np.linalg.norm(vec)) similarities.append((reverse_vocab[idx], cos_sim)) return sorted(similarities, keylambda x: -x[1])[:top_k]6.2 类比推理测试实现经典的king - man woman测试def word_analogy(a, b, c, W, vocab): for word in [a, b, c]: if word not in vocab: return None vec W[vocab[b]] - W[vocab[a]] W[vocab[c]] similarities [] for idx, word_vec in enumerate(W): cos_sim np.dot(vec, word_vec) / (np.linalg.norm(vec) * np.linalg.norm(word_vec)) similarities.append((reverse_vocab[idx], cos_sim)) return sorted(similarities, keylambda x: -x[1])[:5]7. 性能优化技巧7.1 批量训练加速将多个样本组合成小批量def generate_batches(corpus, batch_size32): batch [] for pair in generate_training_data(corpus): batch.append(pair) if len(batch) batch_size: yield batch batch [] if batch: yield batch7.2 学习率衰减策略实现指数衰减学习率class LearningRateScheduler: def __init__(self, initial_lr, decay_rate, decay_steps): self.lr initial_lr self.decay_rate decay_rate self.decay_steps decay_steps self.steps 0 def step(self): self.steps 1 if self.steps % self.decay_steps 0: self.lr * self.decay_rate return self.lr在实现过程中我发现embedding_dim设置为100-300之间效果最佳而负采样数量k5-10对于小型语料库已经足够。学习率初始值0.025配合指数衰减每1000步衰减0.98能获得稳定的训练效果。

3步轻松提取Wallpaper Engine壁纸资源：免费解锁所有PKG和TEX文件

3步轻松提取Wallpaper Engine壁纸资源：免费解锁所有PKG和TEX文件【免费下载链接】repkg Wallpaper engine PKG extractor/TEX to image converter 项目地址: https://gitcode.com/gh_mirrors/re/repkg 你是否曾经被Wallpaper Engine精美的动态壁纸所吸引&am…

2026/6/1 17:48:21 阅读更多

回收奥林巴斯Olympus OLS3000激光共聚焦显微镜

成色要求:6-7成新，无划痕/无磨损/外观轻微使用痕迹二手基础配置:包好，有质保仪器介绍:OLS3000有高的分辨率、高精度、XY分辨率可达0.12微米，3D成像，高精度测量，Z轴最小读数精度0.01微米。实时像的获得和测量同时。408n…

2026/6/1 17:47:39 阅读更多

FPGA逻辑合成工具Bug检测：Lin-Hunter方法解析

1. Lin-Hunter：FPGA逻辑合成工具的Bug检测新方法在电子设计自动化（EDA）领域，FPGA逻辑合成工具扮演着至关重要的角色。这些工具负责将硬件描述语言（HDL）代码转换为可在FPGA上实现的网表文件。然而&#xff…

2026/6/1 17:47:39 阅读更多

用ESP01S和心知天气API做个桌面天气时钟（附完整AT指令流程）

用ESP01S和心知天气API打造智能桌面天气时钟项目概述在智能家居和物联网设备日益普及的今天，DIY一个个性化的桌面天气时钟不仅能满足实用需求，更能体现创客精神。本文将详细介绍如何利用ESP01S模块和心知天气API，从零开始构建一个功能完善、外…

2026/6/1 18:25:12 阅读更多

从‘工件中心’到‘机械手抓取点’：VisionMaster视觉引导中‘定基准’环节的深度实操解析

从‘工件中心’到‘机械手抓取点’：VisionMaster视觉引导中‘定基准’环节的深度实操解析在工业自动化产线中，视觉引导系统的核心任务是将相机捕获的二维图像坐标，精准转换为机器人可执行的三维空间坐标。这一过程如同为机器人安装"眼睛…

2026/6/1 18:25:11 阅读更多

G-Helper终极指南：彻底摆脱Armoury Crate臃肿，释放华硕笔记本全部潜能

G-Helper终极指南：彻底摆脱Armoury Crate臃肿，释放华硕笔记本全部潜能【免费下载链接】g-helper Lightweight Armoury Crate alternative for Asus laptops with nearly the same functionality. Works with ROG Zephyrus, Flow, TUF, Strix, Scar, Pro…

2026/6/1 18:23:50 阅读更多

从新手到专家：Path of Building PoE2构建规划完全指南 [特殊字符]

从新手到专家：Path of Building PoE2构建规划完全指南 🎮 【免费下载链接】PathOfBuilding-PoE2 项目地址: https://gitcode.com/GitHub_Trending/pa/PathOfBuilding-PoE2 你是否曾经在《流放之路2》中投入大量时间和资源打造角色，却…

2026/6/1 18:23:50 阅读更多

知网、维普相继上线AIGC检测！有哪些真正好用的AI率降低工具值得推荐？（双率同步达标不再返工）

2025—2026 年，知网、维普全量落地 AIGC 智能内容检测功能，国内超八成高校把重复率 AIGC 疑似率纳入毕业论文硬性审核标准36氪。多数院校划定安全红线：AIGC 疑似含量需控制在 30% 以内，优质论文要求低于 20%，一旦超标…

2026/6/1 18:23:50 阅读更多

怎样高效使用Forza Painter图片导入工具：3个实用技巧与配置优化指南

怎样高效使用Forza Painter图片导入工具：3个实用技巧与配置优化指南【免费下载链接】forza-painter Import images into Forza 项目地址: https://gitcode.com/gh_mirrors/fo/forza-painter Forza Painter是一款专为《极限竞速：地平线》系列游戏…

2026/6/1 18:22:49 阅读更多

Zotero Duplicates Merger：5步彻底清理文献库重复条目

Zotero Duplicates Merger：5步彻底清理文献库重复条目【免费下载链接】ZoteroDuplicatesMerger A zotero plugin to automatically merge duplicate items 项目地址: https://gitcode.com/gh_mirrors/zo/ZoteroDuplicatesMerger 还在为文献库中堆积如山的重…

2026/6/1 0:00:11 阅读更多

利用随机有限集理论对蜂群的ILQR和MPC控制研究附Matlab代码

✅作者简介：热爱科研的Matlab仿真开发者，擅长数据处理、建模仿真、程序设计、完整代码获取、论文复现及科研仿真。🍎 往期回顾关注个人主页：Matlab科研工作室🍊个人信条：格物致知,完整Matlab代码及仿真咨询…

2026/6/1 0:03:17 阅读更多

为什么你的Gemini邮件CTE低于行业均值2.8倍？：从Prompt架构到发送时序的深度归因

更多请点击： https://intelliparadigm.com 第一章：为什么你的Gemini邮件CTE低于行业均值2.8倍？：从Prompt架构到发送时序的深度归因 Gemini邮件的客户转化效率（CTE）显著偏低，根本原因常被误判为…

2026/6/1 0:06:19 阅读更多

Win10/Win11下Realtek 8188GU网卡驱动感叹号？别急着扔，试试这个手动安装的野路子

Realtek 8188GU网卡驱动故障深度修复指南：从原理到实战当设备管理器里那个顽固的黄色感叹号挥之不去，而你已经尝试了所有"标准操作"——Windows自动更新、第三方驱动工具、甚至重启大法——却依然无济于事时，是时候换个思路了。这篇…

2026/6/1 0:24:01 阅读更多

AnolisOS 8.8安装源配置踩坑实录：从‘设置基础软件仓库时出错’到成功联网的保姆级指南

AnolisOS 8.8安装源配置实战指南：从诊断到解决方案的全流程解析当你在安装AnolisOS 8.8时遇到"设置基础软件仓库时出错"的提示，这通常意味着系统无法访问或识别安装源。这个问题看似简单，但背后可能涉及网络配置、镜像选择、启动参…

2026/6/1 2:19:25 阅读更多

基于树莓派Pico的反应速度测试游戏：从GPIO编程到状态机实战

1. 项目概述与核心思路最近在整理工作室的电子元件，翻出来几个闲置的街机按钮和一块树莓派Pico，灵机一动，决定做个简单又有趣的反应速度测试游戏。这个项目非常适合想入门嵌入式开发的朋友，它不涉及复杂的传感器和通信协议&#x…

2026/6/1 0:23:56 阅读更多

Zotero Duplicates Merger：5步彻底清理文献库重复条目

2026/6/1 0:00:11 阅读更多

利用随机有限集理论对蜂群的ILQR和MPC控制研究附Matlab代码

2026/6/1 0:03:17 阅读更多

为什么你的Gemini邮件CTE低于行业均值2.8倍？：从Prompt架构到发送时序的深度归因

2026/6/1 0:06:19 阅读更多

相关文章

3步轻松提取Wallpaper Engine壁纸资源：免费解锁所有PKG和TEX文件

回收奥林巴斯Olympus OLS3000激光共聚焦显微镜

FPGA逻辑合成工具Bug检测：Lin-Hunter方法解析

用ESP01S和心知天气API做个桌面天气时钟（附完整AT指令流程）

从‘工件中心’到‘机械手抓取点’：VisionMaster视觉引导中‘定基准’环节的深度实操解析

G-Helper终极指南：彻底摆脱Armoury Crate臃肿，释放华硕笔记本全部潜能

从新手到专家：Path of Building PoE2构建规划完全指南 [特殊字符]

知网、维普相继上线AIGC检测！有哪些真正好用的AI率降低工具值得推荐？（双率同步达标不再返工）

怎样高效使用Forza Painter图片导入工具：3个实用技巧与配置优化指南

Zotero Duplicates Merger：5步彻底清理文献库重复条目

利用随机有限集理论对蜂群的ILQR和MPC控制研究附Matlab代码

为什么你的Gemini邮件CTE低于行业均值2.8倍？：从Prompt架构到发送时序的深度归因

Win10/Win11下Realtek 8188GU网卡驱动感叹号？别急着扔，试试这个手动安装的野路子

AnolisOS 8.8安装源配置踩坑实录：从‘设置基础软件仓库时出错’到成功联网的保姆级指南

基于树莓派Pico的反应速度测试游戏：从GPIO编程到状态机实战

Zotero Duplicates Merger：5步彻底清理文献库重复条目

利用随机有限集理论对蜂群的ILQR和MPC控制研究附Matlab代码

为什么你的Gemini邮件CTE低于行业均值2.8倍？：从Prompt架构到发送时序的深度归因