如何加快网络爬虫速度:完整指南

通过高级技术优化您的网页爬取流程,实现更快速的数据获取。
12 分钟阅读

在本文中,你将了解到:

  • 导致网络爬虫速度缓慢的主要原因
  • 提高网络爬虫速度的各种技巧
  • 如何优化一个 Python 爬虫脚本示例,以实现更快速的数据获取

让我们开始吧!

为什么你的爬虫流程会很慢

来看看导致你的网络爬虫速度变慢的一些关键原因。

原因 #1:服务器响应慢

影响爬虫速度最显著的因素之一就是服务器的响应时间。当你向网站发送请求时,服务器需要进行处理并作出回应。如果服务器相应地很慢,那么你的请求就会花费更长时间才能完成。造成服务器缓慢的原因可能是高并发流量、资源有限,或网络本身的延迟。

可惜的是,你几乎无法直接加快目标服务器的速度,这超出了你的掌控范围,除非由于你自身发送了过多请求而使其不堪重负。如果是这种情况,你可以在发送请求时添加随机延迟,将请求分散到更长的时间范围内。

原因 #2:CPU 处理速度较慢

CPU 的处理速度对于你的爬虫脚本能运行多快至关重要。当你的脚本是串行运行时,CPU 需要一次性处理每个操作,这会耗时更久。特别是在脚本包含复杂计算或数据转换时,这种情况更为明显。

此外,HTML 的解析也需要一定时间,可能会明显拖慢你的爬虫流程。想要了解更多,可阅读我们的文章:HTML 网页爬虫

原因 #3:I/O 操作有限

输入/输出(I/O)操作很容易成为爬虫的瓶颈,尤其在你的目标网站包含多个页面时。如果你的脚本在继续下一步之前,必须等待所有外部资源的响应完成,就会带来相当可观的延迟。

发送请求、等待服务器响应、处理响应,然后再移动到下一个请求——这并不是一个高效的爬虫模式。

其他原因

其他导致爬虫脚本变慢的原因还有:

  • 代码效率低:编写不当的爬虫逻辑会使整个流程变慢。避免使用低效的数据结构、不必要的循环或过度日志输出。
  • 速率限制:如果目标网站限制在一段时间内的请求次数,你的自动化爬虫就会因此被拖慢。解决方案?代理服务!
  • 验证码(CAPTCHA)与其他反爬措施:验证码和反机器人措施会要求人工干预,从而干扰你的爬虫流程。更多其他反爬手段可参阅这里。

提高网络爬虫速度的技巧

本节将介绍几种常见的方法来加快网络爬虫流程。我们会从一个基础的 Python 爬虫脚本开始,并演示在该脚本上应用不同优化时产生的效果。

注意:这里探讨的方法适用于任何编程语言或技术。之所以使用 Python,只是因为它相对简单,并且是最适合网络爬虫的编程语言之一

下面是这个 Python 爬虫脚本的初始版本:

import requests
from bs4 import BeautifulSoup
import csv
import time

def scrape_quotes_to_scrape():
    # array of the with the URLs of the page to scrape
    urls = [
        "http://quotes.toscrape.com/",
        "https://quotes.toscrape.com/page/2/",
        "https://quotes.toscrape.com/page/3/",
        "https://quotes.toscrape.com/page/4/",
        "https://quotes.toscrape.com/page/5/",
        "https://quotes.toscrape.com/page/6/",
        "https://quotes.toscrape.com/page/7/",
        "https://quotes.toscrape.com/page/8/",
        "https://quotes.toscrape.com/page/9/",
        "https://quotes.toscrape.com/page/10/"
    ]

    # where to store the scraped data
    quotes = []

    # scrape the pages sequentially
    for url in urls:
        print(f"Scraping page: '{url}'")

        # send a GET request to get the page HTML
        response = requests.get(url)
        # parse the page HTML using BeautifulSoup
        soup = BeautifulSoup(response.content, "html.parser")

        # select all quote elements on the page
        quote_html_elements = soup.select(".quote")

        # iterate over the quote elements and scrape their content
        for quote_html_element in quote_html_elements:
            # extract the text of the quote
            text = quote_html_element.select_one(".text").get_text()
            # extract the author of the quote
            author = quote_html_element.select_one(".author").get_text()
            # extract tags associated with the quote
            tags = [tag.get_text() for tag in quote_html_element.select(".tag")]

            # populate a new quote object and add it to the list
            quote = {
                "text": text,
                "author": author,
                "tags": ", ".join(tags)
            }
            quotes.append(quote)

        print(f"Page '{url}' scraped successfully\n")

    print("Exporting scraped data to CSV")

    # export the scraped quotes to a CSV file
    with open("quotes.csv", "w", newline="", encoding="utf-8") as csvfile:
        fieldnames = ["text", "author", "tags"]
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

        writer.writeheader()
        writer.writerows(quotes)

    print("Quotes exported to CSV\n")

# measure execution time
start_time = time.time()
scrape_quotes_to_scrape()
end_time = time.time()

execution_time = end_time - start_time
print(f"Execution time: {execution_time:.2f} seconds")

上面的爬虫脚本针对 Quotes to Scrape 网站的 10 个分页 URL 进行爬取。对于每个页面,脚本会执行以下步骤:

  1. 使用 requests 发送 GET 请求获取页面 HTML。
  2. 用 BeautifulSoup 解析 HTML 内容。
  3. 在页面上对每个 quote 元素,提取文本、作者和标签。
  4. 将爬取到的数据保存为一个字典列表。

最后,它将所有爬到的数据导出为名为 quotes.csv 的文件。

要运行该脚本,需要先安装所需的库:

pip install requests beautifulsoup4

调用 scrape_quotes_to_scrape() 函数时,使用 time.time() 来记录整个操作的执行时间。在我们的机器上,初始脚本大约需要 4.51 秒 完成。

运行脚本后,会在项目文件夹里生成一个 quotes.csv 文件。并且输出日志类似这样:

Scraping page: 'http://quotes.toscrape.com/'
Page 'http://quotes.toscrape.com/' scraped successfully

Scraping page: 'https://quotes.toscrape.com/page/2/'
Page 'https://quotes.toscrape.com/page/2/' scraped successfully

Scraping page: 'https://quotes.toscrape.com/page/3/'
Page 'https://quotes.toscrape.com/page/3/' scraped successfully

Scraping page: 'https://quotes.toscrape.com/page/4/'
Page 'https://quotes.toscrape.com/page/4/' scraped successfully

Scraping page: 'https://quotes.toscrape.com/page/5/'
Page 'https://quotes.toscrape.com/page/5/' scraped successfully

Scraping page: 'https://quotes.toscrape.com/page/6/'
Page 'https://quotes.toscrape.com/page/6/' scraped successfully

Scraping page: 'https://quotes.toscrape.com/page/7/'
Page 'https://quotes.toscrape.com/page/7/' scraped successfully

Scraping page: 'https://quotes.toscrape.com/page/8/'
Page 'https://quotes.toscrape.com/page/8/' scraped successfully

Scraping page: 'https://quotes.toscrape.com/page/9/'
Page 'https://quotes.toscrape.com/page/9/' scraped successfully

Scraping page: 'https://quotes.toscrape.com/page/10/'
Page 'https://quotes.toscrape.com/page/10/' scraped successfully

Exporting scraped data to CSV
Quotes exported to CSV

Execution time: 4.63 seconds

可以看出,脚本依次按顺序爬取了 Quotes to Scrape 网站的每一个分页。接下来,你会看到,通过一些优化,我们可以显著改变这个流程的执行速度。

现在,我们就来看看怎样加快爬虫速度吧!

1. 使用更快的 HTML 解析库

数据解析需要消耗时间和资源,而不同的 HTML 解析器使用不同的方式来完成这项工作。有些解析器更注重功能丰富,提供自描述的 API;也有些以性能为优先。详细可参考我们关于最佳 HTML 解析器的指南。

在 Python 中,Beautiful Soup 是最常见的 HTML 解析库,但它不一定是最快的。可以查看一些基准测试以了解详情。

实际上,Beautiful Soup 只是对不同底层解析器进行了封装,你可以在初始化时通过第二个参数来指定要使用哪种解析器:

soup = BeautifulSoup(response.content, "html.parser")

通常,Beautiful Soup 会结合 Python 标准库内置的 html.parser 使用。然而,如果你追求性能,可以考虑 lxml。它基于 C 的实现,是 Python 中速度最快的 HTML 解析器之一。

要安装 lxml,可以运行:

pip install lxml

安装完成后,可以在 Beautiful Soup 中这样使用:

soup = BeautifulSoup(response.content, "lxml")

这时,再次运行你的 Python 爬虫脚本,输出可能会是:

# omitted for brevity...

Execution time: 4.35 seconds

执行时间从 4.61 秒降至 4.35 秒。虽然这个变化看上去不算大,但在实际中,这种优化能带来怎样的收益,很大程度上取决于你所解析的 HTML 页面规模和复杂度,以及如何选择页面元素。

在本示例中,目标站点页面结构相对简单、层次也浅。即便如此,仅仅做一点微小的改动,就能带来 6% 左右的速度提升,依然是可观的!

👍 优点

  • 在 Beautiful Soup 中很容易实现

👎 缺点

  • 提升相对较小
  • 对复杂 DOM 结构的页面才更明显
  • 有些更快的 HTML 解析库可能有更复杂的 API

2. 实现多进程爬虫(Multiprocessing)

多进程是一种并行执行的方法,程序会创建多个进程,每个进程可以在一个 CPU 核心上独立执行,从而实现同时处理不同任务,而不再是串行的一步接一步。

这种方法对网络爬虫等 I/O 密集型操作尤其有利。因为这些操作的主要瓶颈往往在于等待网页服务器的响应时间。借助多进程,你可以同时向多个页面发送请求,从而减少整体的爬取时间。

要将你的脚本改为多进程,需要对执行逻辑进行一些重要修改。下面的步骤展示了如何将 Python 的爬虫脚本从串行改造成多进程。

首先,从 Python 的 multiprocessing 模块中导入 Pool 和 cpu_count

from multiprocessing import Pool, cpu_count

Pool 用于管理进程池,而 cpu_count 可以帮你确定可用于并行处理的 CPU 内核数量。

接下来,需要把爬取单个 URL 的逻辑单独封装到一个函数中:

def scrape_page(url):
    print(f"Scraping page: '{url}'")

    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    quote_html_elements = soup.select(".quote")

    quotes = []
    for quote_html_element in quote_html_elements:
        text = quote_html_element.select_one(".text").get_text()
        author = quote_html_element.select_one(".author").get_text()
        tags = [tag.get_text() for tag in quote_html_element.select(".tag")]
        quotes.append({
            "text": text,
            "author": author,
            "tags": ", ".join(tags)
        })

    print(f"Page '{url}' scraped successfully\n")

    return quotes

上面这个函数会在不同的进程中被调用,每个进程都在一个核心上串行执行一次。

然后,用多进程逻辑替代原先的串行爬取:

def scrape_quotes():
    urls = [
        "http://quotes.toscrape.com/",
        "https://quotes.toscrape.com/page/2/",
        "https://quotes.toscrape.com/page/3/",
        "https://quotes.toscrape.com/page/4/",
        "https://quotes.toscrape.com/page/5/",
        "https://quotes.toscrape.com/page/6/",
        "https://quotes.toscrape.com/page/7/",
        "https://quotes.toscrape.com/page/8/",
        "https://quotes.toscrape.com/page/9/",
        "https://quotes.toscrape.com/page/10/"
    ]

    # create a pool of processes
    with Pool(processes=cpu_count()) as pool:
        results = pool.map(scrape_page, urls)

    # flatten the results list
    quotes = [quote for sublist in results for quote in sublist]

    print("Exporting scraped data to CSV")

    with open("quotes_multiprocessing.csv", "w", newline="", encoding="utf-8") as csvfile:
        fieldnames = ["text", "author", "tags"]
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(quotes)

    print("Quotes exported to CSV\n")

最后,使用 time.time() 测量脚本执行时间:

if __name__ == "__main__":
    start_time = time.time()
    scrape_quotes()
    end_time = time.time()

    execution_time = end_time - start_time
    print(f"Execution time: {execution_time:.2f} seconds")

注意,if __name__ == "__main__": 这个结构是必要的,可以避免在导入模块时代码就被执行,从而造成多进程重复创建或其他不可预期的问题,尤其在 Windows 系统上。

将上述代码整合起来:

from multiprocessing import Pool, cpu_count
import requests
from bs4 import BeautifulSoup
import csv
import time

def scrape_page(url):
    print(f"Scraping page: '{url}'")

    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    quote_html_elements = soup.select(".quote")

    quotes = []
    for quote_html_element in quote_html_elements:
        text = quote_html_element.select_one(".text").get_text()
        author = quote_html_element.select_one(".author").get_text()
        tags = [tag.get_text() for tag in quote_html_element.select(".tag")]
        quotes.append({
            "text": text,
            "author": author,
            "tags": ", ".join(tags)
        })

    print(f"Page '{url}' scraped successfully\n")

    return quotes

def scrape_quotes():
    urls = [
        "http://quotes.toscrape.com/",
        "https://quotes.toscrape.com/page/2/",
        "https://quotes.toscrape.com/page/3/",
        "https://quotes.toscrape.com/page/4/",
        "https://quotes.toscrape.com/page/5/",
        "https://quotes.toscrape.com/page/6/",
        "https://quotes.toscrape.com/page/7/",
        "https://quotes.toscrape.com/page/8/",
        "https://quotes.toscrape.com/page/9/",
        "https://quotes.toscrape.com/page/10/"
    ]

    # create a pool of processes
    with Pool(processes=cpu_count()) as pool:
        results = pool.map(scrape_page, urls)

    # flatten the results list
    quotes = [quote for sublist in results for quote in sublist]

    print("Exporting scraped data to CSV")

    with open("quotes_multiprocessing.csv", "w", newline="", encoding="utf-8") as csvfile:
        fieldnames = ["text", "author", "tags"]
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(quotes)

    print("Quotes exported to CSV\n")

if __name__ == "__main__":
    start_time = time.time()
    scrape_quotes()
    end_time = time.time()

    execution_time = end_time - start_time
    print(f"Execution time: {execution_time:.2f} seconds")

再次执行脚本,这一次会输出类似下面的内容:

Scraping page: 'http://quotes.toscrape.com/'
Scraping page: 'https://quotes.toscrape.com/page/2/'
Scraping page: 'https://quotes.toscrape.com/page/3/'
Scraping page: 'https://quotes.toscrape.com/page/4/'
Scraping page: 'https://quotes.toscrape.com/page/5/'
Scraping page: 'https://quotes.toscrape.com/page/6/'
Scraping page: 'https://quotes.toscrape.com/page/7/'
Scraping page: 'https://quotes.toscrape.com/page/8/'
Page 'http://quotes.toscrape.com/' scraped successfully

Scraping page: 'https://quotes.toscrape.com/page/9/'
Page 'https://quotes.toscrape.com/page/3/' scraped successfully

Scraping page: 'https://quotes.toscrape.com/page/10/'
Page 'https://quotes.toscrape.com/page/4/' scraped successfully

Page 'https://quotes.toscrape.com/page/5/' scraped successfully

Page 'https://quotes.toscrape.com/page/6/' scraped successfully

Page 'https://quotes.toscrape.com/page/7/' scraped successfully

Page 'https://quotes.toscrape.com/page/2/' scraped successfully

Page 'https://quotes.toscrape.com/page/8/' scraped successfully

Page 'https://quotes.toscrape.com/page/9/' scraped successfully

Page 'https://quotes.toscrape.com/page/10/' scraped successfully

Exporting scraped data to CSV
Quotes exported to CSV

Execution time: 1.87 seconds

可以看到,页面的爬取顺序不再是严格按顺序执行。此时脚本会同时爬取多个页面,最大并行数量取决于 CPU 核心(这里是 8 个)。

并行处理使脚本的执行时间从大约 4.61 秒降至 1.87 秒,提升率接近 145%,效果非常显著!

👍 优点

  • 爬虫执行时间大幅提高
  • 大多数编程语言都原生支持多进程

👎 缺点

  • 受限于机器上的 CPU 核数量
  • 不再保证按原列表顺序爬取
  • 需要对代码改动较多

3. 实现多线程爬虫(Multithreading)

多线程是一种在单个进程里运行多个线程的并发技术,使你的脚本能同时执行多个任务,每个任务由一个线程来处理。

它与多进程类似,但多线程并不一定需要多个 CPU 核心。因为即使在单个核心上,也可以同时跑多个线程,共享同一片内存空间。可以参考我们关于并发和并行的文章深入探讨这些概念。

要将脚本从串行转化为多线程,整体思路与多进程基本相似。

在下面的示例中,我们会使用 Python concurrent.futures 模块提供的 ThreadPoolExecutor,可以通过如下方式导入:

from concurrent.futures import ThreadPoolExecutor

ThreadPoolExecutor 提供了高级接口来管理一组线程,并并发地运行它们。

正如前面所做的,需要先把“爬取单个 URL”封装到一个单独的函数中,然后借助 ThreadPoolExecutor 并发地调用该函数:

quotes = []

# create a thread pool with up to 10 workers
with ThreadPoolExecutor(max_workers=10) as executor:
    # use map to apply the scrape_page function to each URL
    results = executor.map(scrape_page, urls)

# combine the results from all threads
for result in results:
    quotes.extend(result)

如果 max_workers 没有指定或为 None,默认值会是机器上的处理器数量乘以 5。这里我们只有 10 个页面,设置为 10 就足够了。需要注意的是,线程开太多会导致系统卡顿,甚至适得其反。

整段爬虫脚本如下:

from concurrent.futures import ThreadPoolExecutor
import requests
from bs4 import BeautifulSoup
import csv
import time

def scrape_page(url):
    print(f"Scraping page: '{url}'")

    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    quote_html_elements = soup.select(".quote")

    quotes = []
    for quote_html_element in quote_html_elements:
        text = quote_html_element.select_one(".text").get_text()
        author = quote_html_element.select_one(".author").get_text()
        tags = [tag.get_text() for tag in quote_html_element.select(".tag")]
        quotes.append({
            "text": text,
            "author": author,
            "tags": ", ".join(tags)
        })

    print(f"Page '{url}' scraped successfully\n")

    return quotes

def scrape_quotes():
    urls = [
        "http://quotes.toscrape.com/",
        "https://quotes.toscrape.com/page/2/",
        "https://quotes.toscrape.com/page/3/",
        "https://quotes.toscrape.com/page/4/",
        "https://quotes.toscrape.com/page/5/",
        "https://quotes.toscrape.com/page/6/",
        "https://quotes.toscrape.com/page/7/",
        "https://quotes.toscrape.com/page/8/",
        "https://quotes.toscrape.com/page/9/",
        "https://quotes.toscrape.com/page/10/"
    ]
    
    # where to store the scraped data
    quotes = []

    # create a thread pool with up to 10 workers
    with ThreadPoolExecutor(max_workers=10) as executor:
        # use map to apply the scrape_page function to each URL
        results = executor.map(scrape_page, urls)

    # combine the results from all threads
    for result in results:
        quotes.extend(result)

    print("Exporting scraped data to CSV")

    with open("quotes_multiprocessing.csv", "w", newline="", encoding="utf-8") as csvfile:
        fieldnames = ["text", "author", "tags"]
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(quotes)

    print("Quotes exported to CSV\n")

if __name__ == "__main__":
    start_time = time.time()
    scrape_quotes()
    end_time = time.time()

    execution_time = end_time - start_time
    print(f"Execution time: {execution_time:.2f} seconds")

执行后,会出现如下日志:

Scraping page: 'http://quotes.toscrape.com/'
Scraping page: 'https://quotes.toscrape.com/page/2/'
Scraping page: 'https://quotes.toscrape.com/page/3/'
Scraping page: 'https://quotes.toscrape.com/page/4/'
Scraping page: 'https://quotes.toscrape.com/page/5/'
Scraping page: 'https://quotes.toscrape.com/page/6/'
Scraping page: 'https://quotes.toscrape.com/page/7/'
Scraping page: 'https://quotes.toscrape.com/page/8/'
Scraping page: 'https://quotes.toscrape.com/page/9/'
Scraping page: 'https://quotes.toscrape.com/page/10/'
Page 'http://quotes.toscrape.com/' scraped successfully

Page 'https://quotes.toscrape.com/page/6/' scraped successfully

Page 'https://quotes.toscrape.com/page/7/' scraped successfully

Page 'https://quotes.toscrape.com/page/10/' scraped successfully

Page 'https://quotes.toscrape.com/page/8/' scraped successfully

Page 'https://quotes.toscrape.com/page/5/' scraped successfully

Page 'https://quotes.toscrape.com/page/9/' scraped successfully

Page 'https://quotes.toscrape.com/page/4/' scraped successfully

Page 'https://quotes.toscrape.com/page/3/' scraped successfully

Page 'https://quotes.toscrape.com/page/2/' scraped successfully

Exporting scraped data to CSV
Quotes exported to CSV

Execution time: 0.52 seconds

和多进程一样,页面的执行顺序不再是串行模式。这一次,性能甚至比多进程还好,因为在我们机器上开启了 10 个线程,超过了 CPU 核数(8 个)。

运行时间从 4.61 秒下降到 0.52 秒,大约提升了 885%,相当惊人!

👍 优点

  • 大幅优化爬取速度
  • 大多数语言都原生支持多线程

👎 缺点

  • 找出合适的线程数并不简单
  • 网页爬取顺序不再是按列表顺序
  • 需要对代码改动较多

4. 使用异步 / Await 模式进行爬虫

异步编程是一种现代编程范式,它能够让你编写非阻塞的代码。开发者可以在不显式管理多线程或多进程的情况下处理并发操作。

在传统的同步模型中,每个操作只有在前一个操作结束后才能执行,容易在处理 I/O 密集型操作(如网络爬虫)时效率不高。而通过异步编程,你可以同时启动多个 I/O 操作,在等待它们完成的同时,脚本依然可以“做别的事”,从而提高效率。

在 Python 中,异步爬虫通常使用标准库中的 asyncio。它借助协程来实现单线程的并发,结合 async 和 await 关键字。

不过,像 requests 这类常见的 HTTP 库并不原生支持异步,需要用专门为异步场景设计的AIOHTTP来配合 asyncio。这种组合可以让你同时发送多个 HTTP 请求,而不会阻塞脚本流程。

先使用以下命令安装 AIOHTTP:

pip install aiohttp

然后导入 asyncioaiohttp

import asyncio
import aiohttp

如之前几种方法所示,需要先将爬取单个 URL 的逻辑封装成一个异步函数:

async def scrape_url(session, url):
    async with session.get(url) as response:
        print(f"Scraping page: '{url}'")

        html_content = await response.text()
        soup = BeautifulSoup(html_content, "html.parser")
        # scraping logic...

注意,这里使用 await 来获取网页的 HTML 内容。

要并发执行这个函数,你需要在 AIOHTTP 中创建一个 session,并使用 asyncio.gather 来同时执行多个爬取任务:

# executing the scraping tasks concurrently
async with aiohttp.ClientSession() as session:
    tasks = [scrape_url(session, url) for url in urls]
    results = await asyncio.gather(*tasks)

# flatten the results list
quotes = [quote for sublist in results for quote in sublist]

最后,用 asyncio.run() 来启动你的异步主爬虫函数:

if __name__ == "__main__":
    start_time = time.time()
    asyncio.run(scrape_quotes())
    end_time = time.time()

    execution_time = end_time - start_time
    print(f"Execution time: {execution_time:.2f} seconds")

完整的异步 Python 爬虫示例如下:

import asyncio
import aiohttp
from bs4 import BeautifulSoup
import csv
import time

async def scrape_url(session, url):
    async with session.get(url) as response:
        print(f"Scraping page: '{url}'")

        html_content = await response.text()
        soup = BeautifulSoup(html_content, "html.parser")
        quote_html_elements = soup.select(".quote")

        quotes = []
        for quote_html_element in quote_html_elements:
            text = quote_html_element.select_one(".text").get_text()
            author = quote_html_element.select_one(".author").get_text()
            tags = [tag.get_text() for tag in quote_html_element.select(".tag")]
            quotes.append({
                "text": text,
                "author": author,
                "tags": ", ".join(tags)
            })

        print(f"Page '{url}' scraped successfully\n")

        return quotes

async def scrape_quotes():
    urls = [
        "http://quotes.toscrape.com/",
        "https://quotes.toscrape.com/page/2/",
        "https://quotes.toscrape.com/page/3/",
        "https://quotes.toscrape.com/page/4/",
        "https://quotes.toscrape.com/page/5/",
        "https://quotes.toscrape.com/page/6/",
        "https://quotes.toscrape.com/page/7/",
        "https://quotes.toscrape.com/page/8/",
        "https://quotes.toscrape.com/page/9/",
        "https://quotes.toscrape.com/page/10/"
    ]

    # executing the scraping tasks concurrently
    async with aiohttp.ClientSession() as session:
        tasks = [scrape_url(session, url) for url in urls]
        results = await asyncio.gather(*tasks)

    # flatten the results list
    quotes = [quote for sublist in results for quote in sublist]

    print("Exporting scraped data to CSV")

    with open("quotes_multiprocessing.csv", "w", newline="", encoding="utf-8") as csvfile:
        fieldnames = ["text", "author", "tags"]
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(quotes)

    print("Quotes exported to CSV\n")

if __name__ == "__main__":
    start_time = time.time()
    asyncio.run(scrape_quotes())
    end_time = time.time()

    execution_time = end_time - start_time
    print(f"Execution time: {execution_time:.2f} seconds")

执行后,日志类似如下:

Scraping page: 'http://quotes.toscrape.com/'
Page 'http://quotes.toscrape.com/' scraped successfully                                                                

Scraping page: 'https://quotes.toscrape.com/page/3/'
Scraping page: 'https://quotes.toscrape.com/page/7/'
Scraping page: 'https://quotes.toscrape.com/page/9/'
Scraping page: 'https://quotes.toscrape.com/page/6/'
Scraping page: 'https://quotes.toscrape.com/page/8/'
Scraping page: 'https://quotes.toscrape.com/page/10/'
Page 'https://quotes.toscrape.com/page/3/' scraped successfully

Scraping page: 'https://quotes.toscrape.com/page/5/'
Scraping page: 'https://quotes.toscrape.com/page/4/'
Page 'https://quotes.toscrape.com/page/7/' scraped successfully

Page 'https://quotes.toscrape.com/page/9/' scraped successfully

Page 'https://quotes.toscrape.com/page/6/' scraped successfully

Scraping page: 'https://quotes.toscrape.com/page/2/'
Page 'https://quotes.toscrape.com/page/10/' scraped successfully

Page 'https://quotes.toscrape.com/page/5/' scraped successfully

Page 'https://quotes.toscrape.com/page/4/' scraped successfully

Page 'https://quotes.toscrape.com/page/8/' scraped successfully

Page 'https://quotes.toscrape.com/page/2/' scraped successfully

Exporting scraped data to CSV
Quotes exported to CSV

Execution time: 0.51 seconds

可以看到,执行速度与多线程类似,但优点是不用手动管理线程。

👍 优点

  • 能获得很大的速度提升
  • 现代编程体系普遍推荐的异步逻辑
  • 无需手动管理线程或进程

👎 缺点

  • 相对较难上手
  • 不再按原先的列表顺序执行
  • 需要使用专门的异步库

5. 其他加快爬虫的技巧与方法

除此之外,要加快爬虫速度,还可以考虑:

  • 优化请求速率:调整请求的发送间隔,找到请求速度与避免限流、封禁之间的最佳平衡。
  • 使用轮换代理:通过轮换代理使用多个 IP 地址,从而降低被封禁的风险,并实现更快的爬取。更多可参考最佳轮换代理
  • 分布式并行爬虫:将爬虫任务分散到多台服务器或云端机器上。
  • 减少 JavaScript 渲染:尽量避免启动无头浏览器这类操作,优先使用 HTTP 客户端与 HTML 解析库。要知道,浏览器运行很消耗资源,通常比解析 HTML 慢很多。

总结

本指南展示了几种提高网络爬虫速度的方式,我们深入分析了爬虫脚本变慢的主要原因,并通过一个 Python 示例脚本,探讨了各种优化方法。只需小幅调整爬虫逻辑,就可以让执行时间提高 8 倍之多。

当然,通过手动写代码来优化爬虫逻辑固然重要,选择合适的工具也同样关键。若目标站点用到大量动态渲染,需要使用浏览器自动化工具时,就会变得更复杂,因为浏览器通常比较缓慢且资源占用大。

为解决这些挑战,可以尝试Scraping Browser,它是一个完全托管于云端的爬虫解决方案,能够无缝集成 Puppeteer、Selenium、Playwright 等常用浏览器自动化工具,并配有自动验证码识别功能,同时依托超过 72+ 百万住宅 IP 的代理网络,可无限拓展,轻松应对各种爬虫需求!

即刻注册,开启免费试用吧。