3步实现小红书数据采集：Python自动化分析实战指南

发布时间：2026/6/27 18:30:57

3步实现小红书数据采集Python自动化分析实战指南【免费下载链接】xhs基于小红书 Web 端进行的请求封装。https://reajason.github.io/xhs/项目地址: https://gitcode.com/gh_mirrors/xh/xhs在当今数据驱动的商业环境中小红书作为中国领先的生活方式分享平台每天产生海量的用户生成内容。对于数据分析师、市场研究员和内容创作者而言如何高效、合规地获取这些宝贵数据成为了一个重要课题。xhs工具应运而生这是一个基于小红书Web端请求封装的Python库为开发者提供了稳定、易用的API接口让数据采集变得简单高效。要点总结xhs工具解决了小红书数据采集的技术难题提供完整的API封装无需关注底层网络请求细节支持多种登录方式和数据获取功能如何快速上手小红书数据采集环境准备与安装xhs工具已经发布到PyPI只需简单的安装命令即可开始使用# 使用场景快速安装xhs库并开始数据采集 pip install xhs对于需要最新功能的用户可以直接从GitCode仓库安装# 使用场景获取最新开发版本 git clone https://gitcode.com/gh_mirrors/xh/xhs cd xhs python setup.py install效果说明安装完成后你就可以在Python环境中导入xhs模块开始进行小红书数据采集工作了。核心依赖与配置xhs工具依赖于几个关键库来实现完整功能# 使用场景配置完整的开发环境 pip install playwright requests pycryptodome playwright install要点总结Playwright用于模拟浏览器行为Requests处理HTTP请求PyCryptodome提供加密功能完整的签名验证机制保障请求安全xhs工具的核心功能有哪些xhs工具提供了全面的小红书API封装主要功能模块包括功能模块主要用途适用场景用户认证二维码登录、手机验证码登录自动化账号管理内容采集笔记详情、搜索功能、用户信息竞品分析、内容监控互动操作点赞、收藏、评论社群运营、用户互动数据管理文件上传、内容发布内容自动化发布安全登录系统xhs支持两种安全认证方式确保数据采集的合规性# 使用场景使用二维码登录获取认证信息 from xhs import XhsClient import qrcode def sign(uri, dataNone, a1, web_session): # 签名函数实现 pass xhs_client XhsClient(signsign) qr_res xhs_client.get_qrcode() qr_id qr_res[qr_id] qr_code qr_res[code] # 生成二维码供用户扫描 qr qrcode.QRCode(version1, error_correctionqrcode.ERROR_CORRECT_L, box_size50, border1) qr.add_data(qr_res[url]) qr.make() qr.print_ascii() # 检查登录状态 while True: check_qrcode xhs_client.check_qrcode(qr_id, qr_code) if check_qrcode[code_status] 2: print(登录成功) break效果说明通过二维码登录方式用户无需暴露账号密码即可安全获取访问权限适用于需要长期运行的数据采集任务。数据采集功能xhs的核心数据采集功能覆盖了小红书的主要数据维度# 使用场景获取笔记详情和用户信息 from xhs import XhsClient import json # 初始化客户端 cookie your_cookie_here xhs_client XhsClient(cookie, signsign) # 获取笔记详情 note_info xhs_client.get_note_by_id(63db8819000000001a01ead1) print(f笔记标题{note_info.get(title, )}) print(f点赞数{note_info.get(likes, 0)}) print(f收藏数{note_info.get(collected, 0)}) # 搜索相关笔记 search_results xhs_client.get_note_by_keyword( keywordPython编程, page1, page_size20 ) print(f找到 {len(search_results.get(items, []))} 条相关笔记) # 获取用户信息 user_info xhs_client.get_user_info(user_id_here) print(f用户名{user_info.get(nickname, )}) print(f粉丝数{user_info.get(fans, 0)})常见问题提示如果遇到请求频率限制建议适当增加请求间隔时间避免触发反爬虫机制。如何在实际项目中应用xhs工具市场趋势分析实战对于数据分析师来说通过xhs工具可以快速获取特定领域的内容趋势# 使用场景分析美妆行业内容趋势 import time from datetime import datetime, timedelta from xhs import XhsClient, SearchSortType, SearchNoteType def analyze_beauty_trends(keywords, days7): 分析美妆行业内容趋势 trends_data {} for keyword in keywords: print(f正在分析关键词{keyword}) all_notes [] # 按时间范围采集数据 for i in range(days): date_str (datetime.now() - timedelta(daysi)).strftime(%Y-%m-%d) # 获取搜索结果 results xhs_client.get_note_by_keyword( keywordkeyword, page1, page_size50, sortSearchSortType.TIME_DESC, note_typeSearchNoteType.ALL ) if results and items in results: all_notes.extend(results[items]) # 避免请求过于频繁 time.sleep(1) # 数据分析 total_likes sum(note.get(likes, 0) for note in all_notes) avg_likes total_likes / len(all_notes) if all_notes else 0 trends_data[keyword] { total_notes: len(all_notes), total_likes: total_likes, avg_likes: avg_likes, sample_notes: all_notes[:5] # 取前5条作为样本 } return trends_data # 执行分析 keywords [口红推荐, 粉底液, 眼影教程, 护肤步骤] trends analyze_beauty_trends(keywords, days3) # 输出分析结果 for keyword, data in trends.items(): print(f\n关键词{keyword}) print(f 相关笔记数{data[total_notes]}) print(f 总点赞数{data[total_likes]}) print(f 平均点赞{data[avg_likes]:.1f})效果说明这个示例展示了如何通过xhs工具采集美妆行业相关内容分析不同关键词的热度和用户互动情况为市场决策提供数据支持。竞品账号监测系统对于运营团队监控竞品账号的动态变化至关重要# 使用场景监控竞品账号发布内容 import schedule import time from xhs import XhsClient class CompetitorMonitor: def __init__(self, competitor_ids): self.competitor_ids competitor_ids self.xhs_client XhsClient(cookieyour_cookie, signsign) self.history_data {} def monitor_competitor(self, user_id): 监控单个竞品账号 try: # 获取用户最新笔记 user_notes self.xhs_client.get_user_notes(user_id, cursor) if not user_notes or notes not in user_notes: return None latest_note user_notes[notes][0] if user_notes[notes] else None if latest_note: note_id latest_note.get(id) # 检查是否为新发布内容 if user_id not in self.history_data: self.history_data[user_id] set() if note_id not in self.history_data[user_id]: self.history_data[user_id].add(note_id) # 分析新内容 analysis self.analyze_note(latest_note) self.send_alert(user_id, latest_note, analysis) return analysis return None except Exception as e: print(f监控用户 {user_id} 时出错{e}) return None def analyze_note(self, note): 分析笔记内容 return { title: note.get(title, 无标题), likes: note.get(likes, 0), collected: note.get(collected, 0), comments: note.get(comments, 0), publish_time: note.get(time, ), content_type: 视频 if note.get(type) video else 图文 } def send_alert(self, user_id, note, analysis): 发送监控提醒 print(f\n⚠️ 竞品账号 {user_id} 发布了新内容) print(f 标题{analysis[title]}) print(f 类型{analysis[content_type]}) print(f 点赞{analysis[likes]}) print(f 收藏{analysis[collected]}) print(f 发布时间{analysis[publish_time]}) def start_monitoring(self, interval_minutes30): 启动定时监控 print(f开始监控 {len(self.competitor_ids)} 个竞品账号...) def monitor_all(): for user_id in self.competitor_ids: self.monitor_competitor(user_id) time.sleep(2) # 避免请求过于频繁 # 立即执行一次 monitor_all() # 设置定时任务 schedule.every(interval_minutes).minutes.do(monitor_all) while True: schedule.run_pending() time.sleep(1) # 使用示例 if __name__ __main__: competitor_ids [user_id_1, user_id_2, user_id_3] monitor CompetitorMonitor(competitor_ids) monitor.start_monitoring(interval_minutes60)常见问题提示长时间运行监控任务时建议实现异常重试机制并定期更新cookie以维持会话有效性。有哪些提升数据采集效率的进阶技巧1. 请求优化与缓存策略对于频繁访问的数据实现缓存机制可以显著提升效率# 使用场景实现带缓存的数据采集 from functools import lru_cache import time class CachedXhsClient: def __init__(self, cookie, sign_func): self.client XhsClient(cookie, signsign_func) self.cache {} self.cache_ttl 3600 # 缓存1小时 lru_cache(maxsize100) def get_note_cached(self, note_id): 带缓存的笔记获取 cache_key fnote_{note_id} if cache_key in self.cache: cached_data, timestamp self.cache[cache_key] if time.time() - timestamp self.cache_ttl: return cached_data # 从API获取数据 note_data self.client.get_note_by_id(note_id) # 更新缓存 self.cache[cache_key] (note_data, time.time()) return note_data def batch_get_notes(self, note_ids): 批量获取笔记数据 results [] for note_id in note_ids: try: note_data self.get_note_cached(note_id) results.append(note_data) time.sleep(0.5) # 控制请求频率 except Exception as e: print(f获取笔记 {note_id} 失败{e}) results.append(None) return results2. 错误处理与重试机制稳定的数据采集需要完善的错误处理# 使用场景实现健壮的错误处理 import time from xhs.exception import XHSException, NetworkException def safe_request(func, *args, max_retries3, **kwargs): 带重试机制的安全请求 for attempt in range(max_retries): try: return func(*args, **kwargs) except NetworkException as e: print(f网络错误第{attempt1}次重试{e}) time.sleep(2 ** attempt) # 指数退避策略 except XHSException as e: print(f小红书API错误{e}) break except Exception as e: print(f未知错误{e}) time.sleep(1) return None # 使用示例 result safe_request( xhs_client.get_note_by_id, 63db8819000000001a01ead1, max_retries3 )3. 数据清洗与标准化采集到的数据需要经过清洗才能用于分析# 使用场景数据清洗和标准化处理 import re from datetime import datetime def clean_note_data(raw_note): 清洗笔记数据 cleaned {} # 基础信息 cleaned[id] raw_note.get(id, ) cleaned[title] raw_note.get(title, ).strip() cleaned[desc] raw_note.get(desc, ).strip() # 数值型数据 cleaned[likes] int(raw_note.get(likes, 0)) cleaned[collected] int(raw_note.get(collected, 0)) cleaned[comments] int(raw_note.get(comments, 0)) cleaned[shared] int(raw_note.get(shared, 0)) # 时间处理 timestamp raw_note.get(time, 0) if timestamp: cleaned[publish_time] datetime.fromtimestamp(timestamp).strftime(%Y-%m-%d %H:%M:%S) else: cleaned[publish_time] # 用户信息 user_info raw_note.get(user, {}) cleaned[user_id] user_info.get(user_id, ) cleaned[nickname] user_info.get(nickname, ) cleaned[avatar] user_info.get(avatar, ) # 标签提取 cleaned[tags] [] if tag_list in raw_note: cleaned[tags] [tag.get(name, ) for tag in raw_note[tag_list]] # 互动率计算 total_interaction cleaned[likes] cleaned[collected] cleaned[comments] cleaned[interaction_rate] total_interaction / max(cleaned[likes], 1) * 100 return cleaned4. 性能监控与日志记录建立完善的监控体系有助于发现和解决问题# 使用场景性能监控和日志记录 import logging from datetime import datetime class XhsMonitor: def __init__(self): self.logger logging.getLogger(xhs_monitor) self.logger.setLevel(logging.INFO) # 添加文件处理器 file_handler logging.FileHandler(fxhs_monitor_{datetime.now().strftime(%Y%m%d)}.log) formatter logging.Formatter(%(asctime)s - %(name)s - %(levelname)s - %(message)s) file_handler.setFormatter(formatter) self.logger.addHandler(file_handler) self.metrics { total_requests: 0, successful_requests: 0, failed_requests: 0, avg_response_time: 0 } def record_request(self, endpoint, successTrue, response_time0): 记录请求指标 self.metrics[total_requests] 1 if success: self.metrics[successful_requests] 1 self.logger.info(f请求成功 - 接口{endpoint}响应时间{response_time:.2f}s) else: self.metrics[failed_requests] 1 self.logger.error(f请求失败 - 接口{endpoint}) # 更新平均响应时间 total_time self.metrics[avg_response_time] * (self.metrics[total_requests] - 1) self.metrics[avg_response_time] (total_time response_time) / self.metrics[total_requests] def get_report(self): 获取监控报告 success_rate (self.metrics[successful_requests] / self.metrics[total_requests] * 100 if self.metrics[total_requests] 0 else 0) return { total_requests: self.metrics[total_requests], success_rate: f{success_rate:.1f}%, avg_response_time: f{self.metrics[avg_response_time]:.2f}s, timestamp: datetime.now().strftime(%Y-%m-%d %H:%M:%S) }要点总结缓存机制减少重复请求错误重试提高稳定性数据清洗确保质量监控系统保障运行如何避免常见的技术问题请求频率控制策略小红书平台有严格的反爬虫机制合理的请求频率控制至关重要# 使用场景智能请求频率控制 import random import time class RateLimiter: def __init__(self, base_delay1.0, max_delay5.0): self.base_delay base_delay self.max_delay max_delay self.last_request_time 0 def wait_if_needed(self): 根据需要等待 current_time time.time() elapsed current_time - self.last_request_time if elapsed self.base_delay: sleep_time self.base_delay - elapsed random.uniform(0, 0.5) sleep_time min(sleep_time, self.max_delay) time.sleep(sleep_time) self.last_request_time time.time() def adaptive_delay(self, success_rate): 根据成功率调整延迟 if success_rate 0.8: # 成功率低于80% self.base_delay min(self.base_delay * 1.5, self.max_delay) elif success_rate 0.95: # 成功率高于95% self.base_delay max(self.base_delay * 0.9, 0.5)会话管理与Cookie更新维持有效的会话状态是长期运行的关键# 使用场景会话管理和Cookie更新 class SessionManager: def __init__(self, sign_func): self.sign_func sign_func self.current_cookie None self.cookie_expiry None self.session_duration 3600 # 会话有效期1小时 def get_valid_cookie(self): 获取有效的Cookie if (self.current_cookie and self.cookie_expiry and time.time() self.cookie_expiry): return self.current_cookie # 重新获取Cookie new_cookie self.refresh_cookie() if new_cookie: self.current_cookie new_cookie self.cookie_expiry time.time() self.session_duration return self.current_cookie def refresh_cookie(self): 刷新Cookie try: # 使用二维码登录获取新Cookie xhs_client XhsClient(signself.sign_func) qr_res xhs_client.get_qrcode() # 这里需要用户交互扫描二维码 # 实际应用中可能需要其他方式获取Cookie print(请扫描二维码登录...) # 模拟获取Cookie的过程 return new_cookie_string_here except Exception as e: print(f刷新Cookie失败{e}) return None要点总结控制请求频率避免被封禁合理管理会话状态实现智能重试机制监控成功率动态调整策略开始你的小红书数据探索之旅xhs工具为小红书数据采集提供了强大而灵活的工具链。无论你是进行市场研究、竞品分析还是构建数据驱动的产品这款工具都能帮助你高效获取所需数据。学习资源推荐官方文档基础教程docs/basic.rst - 快速入门指南爬虫技巧docs/crawl.rst - 高级爬虫技巧API参考docs/source/xhs.rst - 完整API文档示例代码基础使用example/basic_usage.py - 基础功能演示签名验证example/basic_sign_usage.py - 签名验证示例服务端部署example/basic_sign_server.py - 服务端部署测试用例tests目录下的测试文件可以帮助你理解各种边界情况和异常处理是学习最佳实践的重要资源。最佳实践建议遵守平台规则合理控制请求频率避免对小红书服务器造成过大压力数据使用规范仅采集公开可访问的数据尊重用户隐私错误处理完善实现完善的异常处理和重试机制性能监控到位建立监控体系及时发现和解决问题代码可维护性模块化设计便于后续扩展和维护记住技术只是手段真正的价值在于如何利用数据创造洞察。在遵守规则的前提下合理使用xhs工具开启你的数据探索之旅吧立即开始pip install xhs深入学习查看example目录下的示例代码遇到问题参考xhs/exception.py中的异常处理指南祝你采集顺利数据洞察满满【免费下载链接】xhs基于小红书 Web 端进行的请求封装。https://reajason.github.io/xhs/项目地址: https://gitcode.com/gh_mirrors/xh/xhs创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

中小型工业现场组网选型：5 口 / 8 口千兆导轨工业交换机硬件参数与落地实践

在中小型自动化产线、设备机房、新能源分站、小型水处理站等场景中，很多工程师会纠结：普通商用交换机无法耐受车间高低温、电压波动与电磁干扰，高端网管型工业交换机成本偏高，小点位场景性价比不足。本文结合两款即插即用型千兆工…

2026/6/27 18:30:57 阅读更多

Havenlon 对抗性完整（三）：执行系统的攻击面，不是入口，而是结果

在前两篇里，我们讨论了两个基本问题。第一，Havenlon 的对抗性完整不是从“谁可信”开始，而是从“谁可能变坏”开始。第二，在执行控制系统里，攻击者不只是黑客，而是任何能够改变最终执行结果的人、系统、流程…

2026/6/27 18:30:36 阅读更多

InDraw 如何旋转结构？

InDraw 提供两种旋转结构的方式，分别适用于“大概转个角度”和“精确旋转”。方法一：按住顶部旋转按钮（自由旋转）用选择工具（右下角箭头图标）单击选中一个结构式。您会看到结构式周围出现一个边框&#xff…

2026/6/27 18:30:16 阅读更多

AI寻访工具关键指标：被动候选人触达率、话术个性化程度及回复率跟踪仪表盘

一、引言：寻访不是群发，是精准触达在招聘链路中，寻访是最考验工具真实能力的环节。发出去一百条消息，有多少人真正回复，回复的人里有多少是真正符合岗位要求的候选人——这两个数字，是AI寻访工具价值的直接…

2026/6/27 20:06:37 阅读更多

Claude Code 遭遇 429 速率限制的常见原因与六种实用解决方案

如果 Claude Code 说「Rate limit reached」，是 API 在告诉你：你花光了每分钟的 token 或请求预算，不是 Anthropic 宕机了。修法就是读一个头、然后给对的那个维度减速。你跑了几个并行子代理、或者一个长的代理循环，Claude Cod…

2026/6/27 20:05:55 阅读更多

openeuler/ubs-virt性能优化技巧：10个实用方法加速虚拟机迁移

openeuler/ubs-virt性能优化技巧：10个实用方法加速虚拟机迁移【免费下载链接】ubs-virt Ubs-virt supports virtualization and pooling, live migration policy decision, fast recovery and disaester recovery, and fast communication between VMs and contain…

2026/6/27 20:04:34 阅读更多

openEuler版本经理必备：release-management会议协调与沟通技巧指南

openEuler版本经理必备：release-management会议协调与沟通技巧指南【免费下载链接】release-management release management including key features, release plan, development and branch plan, release scope and notes 项目地址: https://gitcode.com/opene…

2026/6/27 20:04:13 阅读更多

【Springboot毕设全套源码+文档】基于SpringBoot+Vue框架的高校论坛系统的设计与实现(丰富项目+远程调试+讲解+定制)

博主介绍：✌️码农一枚 ，专注于大学生项目实战开发、讲解和毕业🚢文撰写修改等。全栈领域优质创作者，博客之星、掘金/华为云/阿里云/InfoQ等平台优质作者、专注于Java、小程序技术领域和毕业项目实战 ✌️技术范围：&am…

2026/6/27 20:04:13 阅读更多

Capsule开发指南：创建自定义智能体应用的终极教程

Capsule开发指南：创建自定义智能体应用的终极教程【免费下载链接】capsule Capsule is an openEuler-governed infrastructure platform for agent runtimes, providing a unified and sandboxed foundation for integrating memory, models, and tools. It enable…

2026/6/27 20:03:53 阅读更多

千问AI眼镜：阿里AI战略急先锋，能否在激烈竞争中突围？

千问眼镜：销量第一背后的挑战不久前，不少第三方统计机构给千问背书，甚至给出全国销量第一的统计结果。这个第一的含金量有多高？暂且先打个问号。但这些榜单至少说明，千问眼镜延续了阿里AI战略整体偏激进的风格&#xf…

2026/6/27 0:01:13 阅读更多

Tomcat中X-Frame-Options配置实战：防御点击劫持的四种方法与最佳实践

1. 项目概述：为什么X-Frame-Options是Web安全的“防盗门”？最近在排查一个老项目的安全审计报告时，又被提到了“点击劫持”风险，矛头直指缺失的X-Frame-Options响应头。这已经不是第一次了，很多开发团队，尤…

2026/6/27 0:01:34 阅读更多

10分钟AI语音克隆与实时变声：Retrieval-based-Voice-Conversion-WebUI完整指南

10分钟AI语音克隆与实时变声：Retrieval-based-Voice-Conversion-WebUI完整指南【免费下载链接】Retrieval-based-Voice-Conversion-WebUI Easily train a good VC model with voice data < 10 mins! 项目地址: https://gitcode.com/GitHub_Trending/re/Retrie…

2026/6/27 0:04:02 阅读更多

Google AI Studio 300美元额度的真相与实战指南

1. 这300美金不是“送钱”，而是Google埋下的第一道技术门槛你看到标题里那个醒目的“$300美金”时，第一反应可能是：又一个免费额度？领完就完事？我亲手试过——这300美金根本不是红包，而是一张入场券&…

2026/6/27 1:16:50 阅读更多

PDF对比终极指南：用diff-pdf轻松识别文档差异的完整教程

PDF对比终极指南：用diff-pdf轻松识别文档差异的完整教程【免费下载链接】diff-pdf A simple tool for visually comparing two PDF files 项目地址: https://gitcode.com/gh_mirrors/di/diff-pdf 还在为PDF文档的版本对比而烦恼吗？diff-pdf这款开…

2026/6/27 1:16:49 阅读更多

嵌入式GUI控件实战：ROTARY、SCROLLBAR、SLIDER原理与应用

1. 嵌入式GUI控件：从原理到实战的深度解析在嵌入式系统开发中，图形用户界面（GUI）的设计与实现往往是项目从“能用”到“好用”的关键一跃。不同于资源充沛的PC或移动平台，嵌入式设备的GUI需要在有限的CPU性能、内存空间…

2026/6/27 1:16:48 阅读更多

Zotero Duplicates Merger：5步彻底清理文献库重复条目

Zotero Duplicates Merger：5步彻底清理文献库重复条目【免费下载链接】ZoteroDuplicatesMerger A zotero plugin to automatically merge duplicate items 项目地址: https://gitcode.com/gh_mirrors/zo/ZoteroDuplicatesMerger 还在为文献库中堆积如山的重…

2026/6/27 12:47:04 阅读更多

利用随机有限集理论对蜂群的ILQR和MPC控制研究附Matlab代码

✅作者简介：热爱科研的Matlab仿真开发者，擅长数据处理、建模仿真、程序设计、完整代码获取、论文复现及科研仿真。🍎 往期回顾关注个人主页：Matlab科研工作室🍊个人信条：格物致知,完整Matlab代码及仿真咨询…

2026/6/27 13:25:14 阅读更多

为什么你的Gemini邮件CTE低于行业均值2.8倍？：从Prompt架构到发送时序的深度归因

更多请点击： https://intelliparadigm.com 第一章：为什么你的Gemini邮件CTE低于行业均值2.8倍？：从Prompt架构到发送时序的深度归因 Gemini邮件的客户转化效率（CTE）显著偏低，根本原因常被误判为…

2026/6/27 13:25:16 阅读更多

相关文章