在本文中,你将了解到:
- 导致网络爬虫速度缓慢的主要原因
- 提高网络爬虫速度的各种技巧
- 如何优化一个 Python 爬虫脚本示例,以实现更快速的数据获取
让我们开始吧!
为什么你的爬虫流程会很慢
来看看导致你的网络爬虫速度变慢的一些关键原因。
原因 #1:服务器响应慢
影响爬虫速度最显著的因素之一就是服务器的响应时间。当你向网站发送请求时,服务器需要进行处理并作出回应。如果服务器相应地很慢,那么你的请求就会花费更长时间才能完成。造成服务器缓慢的原因可能是高并发流量、资源有限,或网络本身的延迟。
可惜的是,你几乎无法直接加快目标服务器的速度,这超出了你的掌控范围,除非由于你自身发送了过多请求而使其不堪重负。如果是这种情况,你可以在发送请求时添加随机延迟,将请求分散到更长的时间范围内。
原因 #2:CPU 处理速度较慢
CPU 的处理速度对于你的爬虫脚本能运行多快至关重要。当你的脚本是串行运行时,CPU 需要一次性处理每个操作,这会耗时更久。特别是在脚本包含复杂计算或数据转换时,这种情况更为明显。
此外,HTML 的解析也需要一定时间,可能会明显拖慢你的爬虫流程。想要了解更多,可阅读我们的文章:HTML 网页爬虫。
原因 #3:I/O 操作有限
输入/输出(I/O)操作很容易成为爬虫的瓶颈,尤其在你的目标网站包含多个页面时。如果你的脚本在继续下一步之前,必须等待所有外部资源的响应完成,就会带来相当可观的延迟。
发送请求、等待服务器响应、处理响应,然后再移动到下一个请求——这并不是一个高效的爬虫模式。
其他原因
其他导致爬虫脚本变慢的原因还有:
- 代码效率低:编写不当的爬虫逻辑会使整个流程变慢。避免使用低效的数据结构、不必要的循环或过度日志输出。
- 速率限制:如果目标网站限制在一段时间内的请求次数,你的自动化爬虫就会因此被拖慢。解决方案?代理服务!
- 验证码(CAPTCHA)与其他反爬措施:验证码和反机器人措施会要求人工干预,从而干扰你的爬虫流程。更多其他反爬手段可参阅这里。
提高网络爬虫速度的技巧
本节将介绍几种常见的方法来加快网络爬虫流程。我们会从一个基础的 Python 爬虫脚本开始,并演示在该脚本上应用不同优化时产生的效果。
注意:这里探讨的方法适用于任何编程语言或技术。之所以使用 Python,只是因为它相对简单,并且是最适合网络爬虫的编程语言之一。
下面是这个 Python 爬虫脚本的初始版本:
import requests
from bs4 import BeautifulSoup
import csv
import time
def scrape_quotes_to_scrape():
# array of the with the URLs of the page to scrape
urls = [
"http://quotes.toscrape.com/",
"https://quotes.toscrape.com/page/2/",
"https://quotes.toscrape.com/page/3/",
"https://quotes.toscrape.com/page/4/",
"https://quotes.toscrape.com/page/5/",
"https://quotes.toscrape.com/page/6/",
"https://quotes.toscrape.com/page/7/",
"https://quotes.toscrape.com/page/8/",
"https://quotes.toscrape.com/page/9/",
"https://quotes.toscrape.com/page/10/"
]
# where to store the scraped data
quotes = []
# scrape the pages sequentially
for url in urls:
print(f"Scraping page: '{url}'")
# send a GET request to get the page HTML
response = requests.get(url)
# parse the page HTML using BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")
# select all quote elements on the page
quote_html_elements = soup.select(".quote")
# iterate over the quote elements and scrape their content
for quote_html_element in quote_html_elements:
# extract the text of the quote
text = quote_html_element.select_one(".text").get_text()
# extract the author of the quote
author = quote_html_element.select_one(".author").get_text()
# extract tags associated with the quote
tags = [tag.get_text() for tag in quote_html_element.select(".tag")]
# populate a new quote object and add it to the list
quote = {
"text": text,
"author": author,
"tags": ", ".join(tags)
}
quotes.append(quote)
print(f"Page '{url}' scraped successfully\n")
print("Exporting scraped data to CSV")
# export the scraped quotes to a CSV file
with open("quotes.csv", "w", newline="", encoding="utf-8") as csvfile:
fieldnames = ["text", "author", "tags"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(quotes)
print("Quotes exported to CSV\n")
# measure execution time
start_time = time.time()
scrape_quotes_to_scrape()
end_time = time.time()
execution_time = end_time - start_time
print(f"Execution time: {execution_time:.2f} seconds")
上面的爬虫脚本针对 Quotes to Scrape 网站的 10 个分页 URL 进行爬取。对于每个页面,脚本会执行以下步骤:
- 使用
requests
发送 GET 请求获取页面 HTML。 - 用
BeautifulSoup
解析 HTML 内容。 - 在页面上对每个 quote 元素,提取文本、作者和标签。
- 将爬取到的数据保存为一个字典列表。
最后,它将所有爬到的数据导出为名为 quotes.csv
的文件。
要运行该脚本,需要先安装所需的库:
pip install requests beautifulsoup4
调用 scrape_quotes_to_scrape()
函数时,使用 time.time()
来记录整个操作的执行时间。在我们的机器上,初始脚本大约需要 4.51 秒 完成。
运行脚本后,会在项目文件夹里生成一个 quotes.csv
文件。并且输出日志类似这样:
Scraping page: 'http://quotes.toscrape.com/'
Page 'http://quotes.toscrape.com/' scraped successfully
Scraping page: 'https://quotes.toscrape.com/page/2/'
Page 'https://quotes.toscrape.com/page/2/' scraped successfully
Scraping page: 'https://quotes.toscrape.com/page/3/'
Page 'https://quotes.toscrape.com/page/3/' scraped successfully
Scraping page: 'https://quotes.toscrape.com/page/4/'
Page 'https://quotes.toscrape.com/page/4/' scraped successfully
Scraping page: 'https://quotes.toscrape.com/page/5/'
Page 'https://quotes.toscrape.com/page/5/' scraped successfully
Scraping page: 'https://quotes.toscrape.com/page/6/'
Page 'https://quotes.toscrape.com/page/6/' scraped successfully
Scraping page: 'https://quotes.toscrape.com/page/7/'
Page 'https://quotes.toscrape.com/page/7/' scraped successfully
Scraping page: 'https://quotes.toscrape.com/page/8/'
Page 'https://quotes.toscrape.com/page/8/' scraped successfully
Scraping page: 'https://quotes.toscrape.com/page/9/'
Page 'https://quotes.toscrape.com/page/9/' scraped successfully
Scraping page: 'https://quotes.toscrape.com/page/10/'
Page 'https://quotes.toscrape.com/page/10/' scraped successfully
Exporting scraped data to CSV
Quotes exported to CSV
Execution time: 4.63 seconds
可以看出,脚本依次按顺序爬取了 Quotes to Scrape 网站的每一个分页。接下来,你会看到,通过一些优化,我们可以显著改变这个流程的执行速度。
现在,我们就来看看怎样加快爬虫速度吧!
1. 使用更快的 HTML 解析库
数据解析需要消耗时间和资源,而不同的 HTML 解析器使用不同的方式来完成这项工作。有些解析器更注重功能丰富,提供自描述的 API;也有些以性能为优先。详细可参考我们关于最佳 HTML 解析器的指南。
在 Python 中,Beautiful Soup 是最常见的 HTML 解析库,但它不一定是最快的。可以查看一些基准测试以了解详情。
实际上,Beautiful Soup 只是对不同底层解析器进行了封装,你可以在初始化时通过第二个参数来指定要使用哪种解析器:
soup = BeautifulSoup(response.content, "html.parser")
通常,Beautiful Soup 会结合 Python 标准库内置的 html.parser
使用。然而,如果你追求性能,可以考虑 lxml
。它基于 C 的实现,是 Python 中速度最快的 HTML 解析器之一。
要安装 lxml
,可以运行:
pip install lxml
安装完成后,可以在 Beautiful Soup 中这样使用:
soup = BeautifulSoup(response.content, "lxml")
这时,再次运行你的 Python 爬虫脚本,输出可能会是:
# omitted for brevity...
Execution time: 4.35 seconds
执行时间从 4.61 秒降至 4.35 秒。虽然这个变化看上去不算大,但在实际中,这种优化能带来怎样的收益,很大程度上取决于你所解析的 HTML 页面规模和复杂度,以及如何选择页面元素。
在本示例中,目标站点页面结构相对简单、层次也浅。即便如此,仅仅做一点微小的改动,就能带来 6% 左右的速度提升,依然是可观的!
👍 优点:
- 在 Beautiful Soup 中很容易实现
👎 缺点:
- 提升相对较小
- 对复杂 DOM 结构的页面才更明显
- 有些更快的 HTML 解析库可能有更复杂的 API
2. 实现多进程爬虫(Multiprocessing)
多进程是一种并行执行的方法,程序会创建多个进程,每个进程可以在一个 CPU 核心上独立执行,从而实现同时处理不同任务,而不再是串行的一步接一步。
这种方法对网络爬虫等 I/O 密集型操作尤其有利。因为这些操作的主要瓶颈往往在于等待网页服务器的响应时间。借助多进程,你可以同时向多个页面发送请求,从而减少整体的爬取时间。
要将你的脚本改为多进程,需要对执行逻辑进行一些重要修改。下面的步骤展示了如何将 Python 的爬虫脚本从串行改造成多进程。
首先,从 Python 的 multiprocessing
模块中导入 Pool
和 cpu_count
:
from multiprocessing import Pool, cpu_count
Pool
用于管理进程池,而 cpu_count
可以帮你确定可用于并行处理的 CPU 内核数量。
接下来,需要把爬取单个 URL 的逻辑单独封装到一个函数中:
def scrape_page(url):
print(f"Scraping page: '{url}'")
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
quote_html_elements = soup.select(".quote")
quotes = []
for quote_html_element in quote_html_elements:
text = quote_html_element.select_one(".text").get_text()
author = quote_html_element.select_one(".author").get_text()
tags = [tag.get_text() for tag in quote_html_element.select(".tag")]
quotes.append({
"text": text,
"author": author,
"tags": ", ".join(tags)
})
print(f"Page '{url}' scraped successfully\n")
return quotes
上面这个函数会在不同的进程中被调用,每个进程都在一个核心上串行执行一次。
然后,用多进程逻辑替代原先的串行爬取:
def scrape_quotes():
urls = [
"http://quotes.toscrape.com/",
"https://quotes.toscrape.com/page/2/",
"https://quotes.toscrape.com/page/3/",
"https://quotes.toscrape.com/page/4/",
"https://quotes.toscrape.com/page/5/",
"https://quotes.toscrape.com/page/6/",
"https://quotes.toscrape.com/page/7/",
"https://quotes.toscrape.com/page/8/",
"https://quotes.toscrape.com/page/9/",
"https://quotes.toscrape.com/page/10/"
]
# create a pool of processes
with Pool(processes=cpu_count()) as pool:
results = pool.map(scrape_page, urls)
# flatten the results list
quotes = [quote for sublist in results for quote in sublist]
print("Exporting scraped data to CSV")
with open("quotes_multiprocessing.csv", "w", newline="", encoding="utf-8") as csvfile:
fieldnames = ["text", "author", "tags"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(quotes)
print("Quotes exported to CSV\n")
最后,使用 time.time() 测量脚本执行时间:
if __name__ == "__main__":
start_time = time.time()
scrape_quotes()
end_time = time.time()
execution_time = end_time - start_time
print(f"Execution time: {execution_time:.2f} seconds")
注意,if __name__ == "__main__":
这个结构是必要的,可以避免在导入模块时代码就被执行,从而造成多进程重复创建或其他不可预期的问题,尤其在 Windows 系统上。
将上述代码整合起来:
from multiprocessing import Pool, cpu_count
import requests
from bs4 import BeautifulSoup
import csv
import time
def scrape_page(url):
print(f"Scraping page: '{url}'")
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
quote_html_elements = soup.select(".quote")
quotes = []
for quote_html_element in quote_html_elements:
text = quote_html_element.select_one(".text").get_text()
author = quote_html_element.select_one(".author").get_text()
tags = [tag.get_text() for tag in quote_html_element.select(".tag")]
quotes.append({
"text": text,
"author": author,
"tags": ", ".join(tags)
})
print(f"Page '{url}' scraped successfully\n")
return quotes
def scrape_quotes():
urls = [
"http://quotes.toscrape.com/",
"https://quotes.toscrape.com/page/2/",
"https://quotes.toscrape.com/page/3/",
"https://quotes.toscrape.com/page/4/",
"https://quotes.toscrape.com/page/5/",
"https://quotes.toscrape.com/page/6/",
"https://quotes.toscrape.com/page/7/",
"https://quotes.toscrape.com/page/8/",
"https://quotes.toscrape.com/page/9/",
"https://quotes.toscrape.com/page/10/"
]
# create a pool of processes
with Pool(processes=cpu_count()) as pool:
results = pool.map(scrape_page, urls)
# flatten the results list
quotes = [quote for sublist in results for quote in sublist]
print("Exporting scraped data to CSV")
with open("quotes_multiprocessing.csv", "w", newline="", encoding="utf-8") as csvfile:
fieldnames = ["text", "author", "tags"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(quotes)
print("Quotes exported to CSV\n")
if __name__ == "__main__":
start_time = time.time()
scrape_quotes()
end_time = time.time()
execution_time = end_time - start_time
print(f"Execution time: {execution_time:.2f} seconds")
再次执行脚本,这一次会输出类似下面的内容:
Scraping page: 'http://quotes.toscrape.com/'
Scraping page: 'https://quotes.toscrape.com/page/2/'
Scraping page: 'https://quotes.toscrape.com/page/3/'
Scraping page: 'https://quotes.toscrape.com/page/4/'
Scraping page: 'https://quotes.toscrape.com/page/5/'
Scraping page: 'https://quotes.toscrape.com/page/6/'
Scraping page: 'https://quotes.toscrape.com/page/7/'
Scraping page: 'https://quotes.toscrape.com/page/8/'
Page 'http://quotes.toscrape.com/' scraped successfully
Scraping page: 'https://quotes.toscrape.com/page/9/'
Page 'https://quotes.toscrape.com/page/3/' scraped successfully
Scraping page: 'https://quotes.toscrape.com/page/10/'
Page 'https://quotes.toscrape.com/page/4/' scraped successfully
Page 'https://quotes.toscrape.com/page/5/' scraped successfully
Page 'https://quotes.toscrape.com/page/6/' scraped successfully
Page 'https://quotes.toscrape.com/page/7/' scraped successfully
Page 'https://quotes.toscrape.com/page/2/' scraped successfully
Page 'https://quotes.toscrape.com/page/8/' scraped successfully
Page 'https://quotes.toscrape.com/page/9/' scraped successfully
Page 'https://quotes.toscrape.com/page/10/' scraped successfully
Exporting scraped data to CSV
Quotes exported to CSV
Execution time: 1.87 seconds
可以看到,页面的爬取顺序不再是严格按顺序执行。此时脚本会同时爬取多个页面,最大并行数量取决于 CPU 核心(这里是 8 个)。
并行处理使脚本的执行时间从大约 4.61 秒降至 1.87 秒,提升率接近 145%,效果非常显著!
👍 优点:
- 爬虫执行时间大幅提高
- 大多数编程语言都原生支持多进程
👎 缺点:
- 受限于机器上的 CPU 核数量
- 不再保证按原列表顺序爬取
- 需要对代码改动较多
3. 实现多线程爬虫(Multithreading)
多线程是一种在单个进程里运行多个线程的并发技术,使你的脚本能同时执行多个任务,每个任务由一个线程来处理。
它与多进程类似,但多线程并不一定需要多个 CPU 核心。因为即使在单个核心上,也可以同时跑多个线程,共享同一片内存空间。可以参考我们关于并发和并行的文章深入探讨这些概念。
要将脚本从串行转化为多线程,整体思路与多进程基本相似。
在下面的示例中,我们会使用 Python concurrent.futures
模块提供的 ThreadPoolExecutor
,可以通过如下方式导入:
from concurrent.futures import ThreadPoolExecutor
ThreadPoolExecutor
提供了高级接口来管理一组线程,并并发地运行它们。
正如前面所做的,需要先把“爬取单个 URL”封装到一个单独的函数中,然后借助 ThreadPoolExecutor
并发地调用该函数:
quotes = []
# create a thread pool with up to 10 workers
with ThreadPoolExecutor(max_workers=10) as executor:
# use map to apply the scrape_page function to each URL
results = executor.map(scrape_page, urls)
# combine the results from all threads
for result in results:
quotes.extend(result)
如果 max_workers
没有指定或为 None
,默认值会是机器上的处理器数量乘以 5。这里我们只有 10 个页面,设置为 10 就足够了。需要注意的是,线程开太多会导致系统卡顿,甚至适得其反。
整段爬虫脚本如下:
from concurrent.futures import ThreadPoolExecutor
import requests
from bs4 import BeautifulSoup
import csv
import time
def scrape_page(url):
print(f"Scraping page: '{url}'")
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
quote_html_elements = soup.select(".quote")
quotes = []
for quote_html_element in quote_html_elements:
text = quote_html_element.select_one(".text").get_text()
author = quote_html_element.select_one(".author").get_text()
tags = [tag.get_text() for tag in quote_html_element.select(".tag")]
quotes.append({
"text": text,
"author": author,
"tags": ", ".join(tags)
})
print(f"Page '{url}' scraped successfully\n")
return quotes
def scrape_quotes():
urls = [
"http://quotes.toscrape.com/",
"https://quotes.toscrape.com/page/2/",
"https://quotes.toscrape.com/page/3/",
"https://quotes.toscrape.com/page/4/",
"https://quotes.toscrape.com/page/5/",
"https://quotes.toscrape.com/page/6/",
"https://quotes.toscrape.com/page/7/",
"https://quotes.toscrape.com/page/8/",
"https://quotes.toscrape.com/page/9/",
"https://quotes.toscrape.com/page/10/"
]
# where to store the scraped data
quotes = []
# create a thread pool with up to 10 workers
with ThreadPoolExecutor(max_workers=10) as executor:
# use map to apply the scrape_page function to each URL
results = executor.map(scrape_page, urls)
# combine the results from all threads
for result in results:
quotes.extend(result)
print("Exporting scraped data to CSV")
with open("quotes_multiprocessing.csv", "w", newline="", encoding="utf-8") as csvfile:
fieldnames = ["text", "author", "tags"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(quotes)
print("Quotes exported to CSV\n")
if __name__ == "__main__":
start_time = time.time()
scrape_quotes()
end_time = time.time()
execution_time = end_time - start_time
print(f"Execution time: {execution_time:.2f} seconds")
执行后,会出现如下日志:
Scraping page: 'http://quotes.toscrape.com/'
Scraping page: 'https://quotes.toscrape.com/page/2/'
Scraping page: 'https://quotes.toscrape.com/page/3/'
Scraping page: 'https://quotes.toscrape.com/page/4/'
Scraping page: 'https://quotes.toscrape.com/page/5/'
Scraping page: 'https://quotes.toscrape.com/page/6/'
Scraping page: 'https://quotes.toscrape.com/page/7/'
Scraping page: 'https://quotes.toscrape.com/page/8/'
Scraping page: 'https://quotes.toscrape.com/page/9/'
Scraping page: 'https://quotes.toscrape.com/page/10/'
Page 'http://quotes.toscrape.com/' scraped successfully
Page 'https://quotes.toscrape.com/page/6/' scraped successfully
Page 'https://quotes.toscrape.com/page/7/' scraped successfully
Page 'https://quotes.toscrape.com/page/10/' scraped successfully
Page 'https://quotes.toscrape.com/page/8/' scraped successfully
Page 'https://quotes.toscrape.com/page/5/' scraped successfully
Page 'https://quotes.toscrape.com/page/9/' scraped successfully
Page 'https://quotes.toscrape.com/page/4/' scraped successfully
Page 'https://quotes.toscrape.com/page/3/' scraped successfully
Page 'https://quotes.toscrape.com/page/2/' scraped successfully
Exporting scraped data to CSV
Quotes exported to CSV
Execution time: 0.52 seconds
和多进程一样,页面的执行顺序不再是串行模式。这一次,性能甚至比多进程还好,因为在我们机器上开启了 10 个线程,超过了 CPU 核数(8 个)。
运行时间从 4.61 秒下降到 0.52 秒,大约提升了 885%,相当惊人!
👍 优点:
- 大幅优化爬取速度
- 大多数语言都原生支持多线程
👎 缺点:
- 找出合适的线程数并不简单
- 网页爬取顺序不再是按列表顺序
- 需要对代码改动较多
4. 使用异步 / Await 模式进行爬虫
异步编程是一种现代编程范式,它能够让你编写非阻塞的代码。开发者可以在不显式管理多线程或多进程的情况下处理并发操作。
在传统的同步模型中,每个操作只有在前一个操作结束后才能执行,容易在处理 I/O 密集型操作(如网络爬虫)时效率不高。而通过异步编程,你可以同时启动多个 I/O 操作,在等待它们完成的同时,脚本依然可以“做别的事”,从而提高效率。
在 Python 中,异步爬虫通常使用标准库中的 asyncio
。它借助协程来实现单线程的并发,结合 async
和 await
关键字。
不过,像 requests
这类常见的 HTTP 库并不原生支持异步,需要用专门为异步场景设计的AIOHTTP来配合 asyncio
。这种组合可以让你同时发送多个 HTTP 请求,而不会阻塞脚本流程。
先使用以下命令安装 AIOHTTP:
pip install aiohttp
然后导入 asyncio
和 aiohttp
:
import asyncio
import aiohttp
如之前几种方法所示,需要先将爬取单个 URL 的逻辑封装成一个异步函数:
async def scrape_url(session, url):
async with session.get(url) as response:
print(f"Scraping page: '{url}'")
html_content = await response.text()
soup = BeautifulSoup(html_content, "html.parser")
# scraping logic...
注意,这里使用 await
来获取网页的 HTML 内容。
要并发执行这个函数,你需要在 AIOHTTP 中创建一个 session,并使用 asyncio.gather
来同时执行多个爬取任务:
# executing the scraping tasks concurrently
async with aiohttp.ClientSession() as session:
tasks = [scrape_url(session, url) for url in urls]
results = await asyncio.gather(*tasks)
# flatten the results list
quotes = [quote for sublist in results for quote in sublist]
最后,用 asyncio.run()
来启动你的异步主爬虫函数:
if __name__ == "__main__":
start_time = time.time()
asyncio.run(scrape_quotes())
end_time = time.time()
execution_time = end_time - start_time
print(f"Execution time: {execution_time:.2f} seconds")
完整的异步 Python 爬虫示例如下:
import asyncio
import aiohttp
from bs4 import BeautifulSoup
import csv
import time
async def scrape_url(session, url):
async with session.get(url) as response:
print(f"Scraping page: '{url}'")
html_content = await response.text()
soup = BeautifulSoup(html_content, "html.parser")
quote_html_elements = soup.select(".quote")
quotes = []
for quote_html_element in quote_html_elements:
text = quote_html_element.select_one(".text").get_text()
author = quote_html_element.select_one(".author").get_text()
tags = [tag.get_text() for tag in quote_html_element.select(".tag")]
quotes.append({
"text": text,
"author": author,
"tags": ", ".join(tags)
})
print(f"Page '{url}' scraped successfully\n")
return quotes
async def scrape_quotes():
urls = [
"http://quotes.toscrape.com/",
"https://quotes.toscrape.com/page/2/",
"https://quotes.toscrape.com/page/3/",
"https://quotes.toscrape.com/page/4/",
"https://quotes.toscrape.com/page/5/",
"https://quotes.toscrape.com/page/6/",
"https://quotes.toscrape.com/page/7/",
"https://quotes.toscrape.com/page/8/",
"https://quotes.toscrape.com/page/9/",
"https://quotes.toscrape.com/page/10/"
]
# executing the scraping tasks concurrently
async with aiohttp.ClientSession() as session:
tasks = [scrape_url(session, url) for url in urls]
results = await asyncio.gather(*tasks)
# flatten the results list
quotes = [quote for sublist in results for quote in sublist]
print("Exporting scraped data to CSV")
with open("quotes_multiprocessing.csv", "w", newline="", encoding="utf-8") as csvfile:
fieldnames = ["text", "author", "tags"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(quotes)
print("Quotes exported to CSV\n")
if __name__ == "__main__":
start_time = time.time()
asyncio.run(scrape_quotes())
end_time = time.time()
execution_time = end_time - start_time
print(f"Execution time: {execution_time:.2f} seconds")
执行后,日志类似如下:
Scraping page: 'http://quotes.toscrape.com/'
Page 'http://quotes.toscrape.com/' scraped successfully
Scraping page: 'https://quotes.toscrape.com/page/3/'
Scraping page: 'https://quotes.toscrape.com/page/7/'
Scraping page: 'https://quotes.toscrape.com/page/9/'
Scraping page: 'https://quotes.toscrape.com/page/6/'
Scraping page: 'https://quotes.toscrape.com/page/8/'
Scraping page: 'https://quotes.toscrape.com/page/10/'
Page 'https://quotes.toscrape.com/page/3/' scraped successfully
Scraping page: 'https://quotes.toscrape.com/page/5/'
Scraping page: 'https://quotes.toscrape.com/page/4/'
Page 'https://quotes.toscrape.com/page/7/' scraped successfully
Page 'https://quotes.toscrape.com/page/9/' scraped successfully
Page 'https://quotes.toscrape.com/page/6/' scraped successfully
Scraping page: 'https://quotes.toscrape.com/page/2/'
Page 'https://quotes.toscrape.com/page/10/' scraped successfully
Page 'https://quotes.toscrape.com/page/5/' scraped successfully
Page 'https://quotes.toscrape.com/page/4/' scraped successfully
Page 'https://quotes.toscrape.com/page/8/' scraped successfully
Page 'https://quotes.toscrape.com/page/2/' scraped successfully
Exporting scraped data to CSV
Quotes exported to CSV
Execution time: 0.51 seconds
可以看到,执行速度与多线程类似,但优点是不用手动管理线程。
👍 优点:
- 能获得很大的速度提升
- 现代编程体系普遍推荐的异步逻辑
- 无需手动管理线程或进程
👎 缺点:
- 相对较难上手
- 不再按原先的列表顺序执行
- 需要使用专门的异步库
5. 其他加快爬虫的技巧与方法
除此之外,要加快爬虫速度,还可以考虑:
- 优化请求速率:调整请求的发送间隔,找到请求速度与避免限流、封禁之间的最佳平衡。
- 使用轮换代理:通过轮换代理使用多个 IP 地址,从而降低被封禁的风险,并实现更快的爬取。更多可参考最佳轮换代理。
- 分布式并行爬虫:将爬虫任务分散到多台服务器或云端机器上。
- 减少 JavaScript 渲染:尽量避免启动无头浏览器这类操作,优先使用 HTTP 客户端与 HTML 解析库。要知道,浏览器运行很消耗资源,通常比解析 HTML 慢很多。
总结
本指南展示了几种提高网络爬虫速度的方式,我们深入分析了爬虫脚本变慢的主要原因,并通过一个 Python 示例脚本,探讨了各种优化方法。只需小幅调整爬虫逻辑,就可以让执行时间提高 8 倍之多。
当然,通过手动写代码来优化爬虫逻辑固然重要,选择合适的工具也同样关键。若目标站点用到大量动态渲染,需要使用浏览器自动化工具时,就会变得更复杂,因为浏览器通常比较缓慢且资源占用大。
为解决这些挑战,可以尝试Scraping Browser,它是一个完全托管于云端的爬虫解决方案,能够无缝集成 Puppeteer、Selenium、Playwright 等常用浏览器自动化工具,并配有自动验证码识别功能,同时依托超过 72+ 百万住宅 IP 的代理网络,可无限拓展,轻松应对各种爬虫需求!
即刻注册,开启免费试用吧。