Python协程实战:异步高效爬取《鬼神传》全本小说 一、项目背景在网络爬虫开发中同步爬取大量小说章节时效率低下等待IO时间过长。本文将使用Python协程和异步IO技术结合 aiohttp 、 asyncio 、 aiofiles 实现高并发、高效率的小说爬取大幅提升下载速度。二、技术选型异步HTTP请求aiohttp​异步文件写入aiofiles​HTML解析lxml​协程调度asyncio​网页请求requests三、完整代码实现import requests from lxml import etree import time import asyncio import aiohttp import aiofiles import os BASE_URL https://www.zanghaihua.org # 获取所有章节链接 def get_every_chapter_url(url): headers { User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 } resp requests.get(url, headersheaders) resp.encoding utf-8 tree etree.HTML(resp.text) a_list tree.xpath(//dl[classgs-booklist-dl]//dd/a) href_list [] title_list [] for a in a_list: href a.xpath(./href)[0] title a.xpath(./text())[0] full_url BASE_URL href href_list.append(full_url) title_list.append(title) print(f成功获取 {len(href_list)} 章目录) return href_list, title_list # 下载单章 async def download_one(session, url, title): headers { User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36, Accept-Encoding: gzip, deflate } try: async with session.get(url, headersheaders, timeout20) as resp: page_text await resp.text(encodingutf-8, errorsignore) tree etree.HTML(page_text) content_list tree.xpath(//div[classgs-article-text]//p//text()) content \n.join([text.strip() for text in content_list if text.strip()]) if not os.path.exists(./鬼神传): os.mkdir(./鬼神传) async with aiofiles.open(f./鬼神传/{title}.txt, w, encodingutf-8) as f: await f.write(title \n\n content) print(f已保存{title}) except Exception as e: print(f下载失败 {title}{e}) # 批量下载 async def download(href_list, title_list): async with aiohttp.ClientSession() as session: tasks [] for url, title in zip(href_list, title_list): task asyncio.create_task(download_one(session, url, title)) tasks.append(task) await asyncio.gather(*tasks) # 主函数 def main(): start time.time() book_url https://www.zanghaihua.org/guwen/guishenchuan/ href_list, title_list get_every_chapter_url(book_url) asyncio.run(download(href_list, title_list)) end time.time() print(f《鬼神传》全部下载完成总耗时{end - start:.2f} 秒) if __name__ __main__: main()四、代码详解1. 目录获取函数同步def get_every_chapter_url(url): # 请求头伪装浏览器 headers { User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 } resp requests.get(url, headersheaders) resp.encoding utf-8 tree etree.HTML(resp.text) # XPath提取所有章节链接与标题 a_list tree.xpath(//dl[classgs-booklist-dl]//dd/a)用 requests 同步获取目录页目录页仅1次请求同步无性能影响​XPath 精准定位章节 a 标签提取 href 与 title2. 单章异步下载async def download_one(session, url, title): async with session.get(url, headersheaders, timeout20) as resp: page_text await resp.text(encodingutf-8, errorsignore) tree etree.HTML(page_text) content_list tree.xpath(//div[classgs-article-text]//p//text())async/await 实现异步非阻塞请求​session.get 复用连接提升效率​异常捕获保证单个章节失败不影响整体3. 异步批量下载async def download(href_list, title_list): async with aiohttp.ClientSession() as session: tasks [asyncio.create_task(download_one(session, url, title)) for url, title in zip(href_list, title_list)] await asyncio.gather(*tasks)创建任务列表 gather 并发执行​真正实现多章节同时下载速度提升10~50倍4. 主函数调度def main(): start time.time() href_list, title_list get_every_chapter_url(book_url) asyncio.run(download(href_list, title_list)) end time.time() print(f《鬼神传》全部下载完成总耗时{end - start:.2f} 秒)计时统计直观展示异步爬取效率五、运行效果几十章内容10秒内完成同步爬取需30秒以上六、关键优化点1. 连接复用 aiohttp.ClientSession 减少TCP握手开销​2. 异步文件 aiofiles 避免磁盘IO阻塞事件循环​3. 异常处理单章下载失败不中断整体任务​4. 编码兼容 errorsignore 避免乱码崩溃​5. 目录自动创建自动生成 ./鬼神传 文件夹七、注意事项1. 遵守网站 robots.txt 协议请勿用于商业用途​2. 合理设置并发量避免给服务器造成压力​3. 可添加延时、代理IP进一步降低风险​4. 本代码仅用于学习Python协程与爬虫技术八、总结通过 asyncio aiohttp aiofiles 实现的异步协程爬虫完美解决了同步爬虫IO阻塞问题在小说、图片、网页批量下载场景中效率极高。掌握这套技术可轻松应对大批量、高并发的数据采集需求。本文为原创技术文章禁止转载