如何加速网页爬取：技术与技巧

在本文中，你将了解到：

导致网络爬虫速度缓慢的主要原因
提高网络爬虫速度的各种技巧
如何优化一个 Python 爬虫脚本示例，以实现更快速的数据获取

让我们开始吧！

为什么你的爬虫流程会很慢

来看看导致你的网络爬虫速度变慢的一些关键原因。

原因 #1：服务器响应慢

影响爬虫速度最显著的因素之一就是服务器的响应时间。当你向网站发送请求时，服务器需要进行处理并作出回应。如果服务器相应地很慢，那么你的请求就会花费更长时间才能完成。造成服务器缓慢的原因可能是高并发流量、资源有限，或网络本身的延迟。

可惜的是，你几乎无法直接加快目标服务器的速度，这超出了你的掌控范围，除非由于你自身发送了过多请求而使其不堪重负。如果是这种情况，你可以在发送请求时添加随机延迟，将请求分散到更长的时间范围内。

原因 #2：CPU 处理速度较慢

CPU 的处理速度对于你的爬虫脚本能运行多快至关重要。当你的脚本是串行运行时，CPU 需要一次性处理每个操作，这会耗时更久。特别是在脚本包含复杂计算或数据转换时，这种情况更为明显。

此外，HTML 的解析也需要一定时间，可能会明显拖慢你的爬虫流程。想要了解更多，可阅读我们的文章：HTML 网页爬虫。

原因 #3：I/O 操作有限

输入/输出（I/O）操作很容易成为爬虫的瓶颈，尤其在你的目标网站包含多个页面时。如果你的脚本在继续下一步之前，必须等待所有外部资源的响应完成，就会带来相当可观的延迟。

发送请求、等待服务器响应、处理响应，然后再移动到下一个请求——这并不是一个高效的爬虫模式。

其他原因

其他导致爬虫脚本变慢的原因还有：

代码效率低：编写不当的爬虫逻辑会使整个流程变慢。避免使用低效的数据结构、不必要的循环或过度日志输出。
速率限制：如果目标网站限制在一段时间内的请求次数，你的自动化爬虫就会因此被拖慢。解决方案？代理服务！
验证码（CAPTCHA）与其他反爬措施：验证码和反机器人措施会要求人工干预，从而干扰你的爬虫流程。更多其他反爬手段可参阅这里。

提高网络爬虫速度的技巧

本节将介绍几种常见的方法来加快网络爬虫流程。我们会从一个基础的 Python 爬虫脚本开始，并演示在该脚本上应用不同优化时产生的效果。

注意：这里探讨的方法适用于任何编程语言或技术。之所以使用 Python，只是因为它相对简单，并且是最适合网络爬虫的编程语言之一。

下面是这个 Python 爬虫脚本的初始版本：

import requests
from bs4 import BeautifulSoup
import csv
import time

def scrape_quotes_to_scrape():
    # array of the with the URLs of the page to scrape
    urls = [
        "http://quotes.toscrape.com/",
        "https://quotes.toscrape.com/page/2/",
        "https://quotes.toscrape.com/page/3/",
        "https://quotes.toscrape.com/page/4/",
        "https://quotes.toscrape.com/page/5/",
        "https://quotes.toscrape.com/page/6/",
        "https://quotes.toscrape.com/page/7/",
        "https://quotes.toscrape.com/page/8/",
        "https://quotes.toscrape.com/page/9/",
        "https://quotes.toscrape.com/page/10/"
    ]

    # where to store the scraped data
    quotes = []

    # scrape the pages sequentially
    for url in urls:
        print(f"Scraping page: '{url}'")

        # send a GET request to get the page HTML
        response = requests.get(url)
        # parse the page HTML using BeautifulSoup
        soup = BeautifulSoup(response.content, "html.parser")

        # select all quote elements on the page
        quote_html_elements = soup.select(".quote")

        # iterate over the quote elements and scrape their content
        for quote_html_element in quote_html_elements:
            # extract the text of the quote
            text = quote_html_element.select_one(".text").get_text()
            # extract the author of the quote
            author = quote_html_element.select_one(".author").get_text()
            # extract tags associated with the quote
            tags = [tag.get_text() for tag in quote_html_element.select(".tag")]

            # populate a new quote object and add it to the list
            quote = {
                "text": text,
                "author": author,
                "tags": ", ".join(tags)
            }
            quotes.append(quote)

        print(f"Page '{url}' scraped successfullyn")

    print("Exporting scraped data to CSV")

    # export the scraped quotes to a CSV file
    with open("quotes.csv", "w", newline="", encoding="utf-8") as csvfile:
        fieldnames = ["text", "author", "tags"]
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

        writer.writeheader()
        writer.writerows(quotes)

    print("Quotes exported to CSVn")

# measure execution time
start_time = time.time()
scrape_quotes_to_scrape()
end_time = time.time()

execution_time = end_time - start_time
print(f"Execution time: {execution_time:.2f} seconds")

上面的爬虫脚本针对 Quotes to Scrape 网站的 10 个分页 URL 进行爬取。对于每个页面，脚本会执行以下步骤：

使用 requests 发送 GET 请求获取页面 HTML。
用 BeautifulSoup 解析 HTML 内容。
在页面上对每个 quote 元素，提取文本、作者和标签。
将爬取到的数据保存为一个字典列表。

最后，它将所有爬到的数据导出为名为 quotes.csv 的文件。

要运行该脚本，需要先安装所需的库：

pip install requests beautifulsoup4

调用 scrape_quotes_to_scrape() 函数时，使用 time.time() 来记录整个操作的执行时间。在我们的机器上，初始脚本大约需要 4.51 秒 完成。

运行脚本后，会在项目文件夹里生成一个 quotes.csv 文件。并且输出日志类似这样：

Scraping page: 'http://quotes.toscrape.com/'
Page 'http://quotes.toscrape.com/' scraped successfully

Scraping page: 'https://quotes.toscrape.com/page/2/'
Page 'https://quotes.toscrape.com/page/2/' scraped successfully

Scraping page: 'https://quotes.toscrape.com/page/3/'
Page 'https://quotes.toscrape.com/page/3/' scraped successfully

Scraping page: 'https://quotes.toscrape.com/page/4/'
Page 'https://quotes.toscrape.com/page/4/' scraped successfully

Scraping page: 'https://quotes.toscrape.com/page/5/'
Page 'https://quotes.toscrape.com/page/5/' scraped successfully

Scraping page: 'https://quotes.toscrape.com/page/6/'
Page 'https://quotes.toscrape.com/page/6/' scraped successfully

Scraping page: 'https://quotes.toscrape.com/page/7/'
Page 'https://quotes.toscrape.com/page/7/' scraped successfully

Scraping page: 'https://quotes.toscrape.com/page/8/'
Page 'https://quotes.toscrape.com/page/8/' scraped successfully

Scraping page: 'https://quotes.toscrape.com/page/9/'
Page 'https://quotes.toscrape.com/page/9/' scraped successfully

Scraping page: 'https://quotes.toscrape.com/page/10/'
Page 'https://quotes.toscrape.com/page/10/' scraped successfully

Exporting scraped data to CSV
Quotes exported to CSV

Execution time: 4.63 seconds

可以看出，脚本依次按顺序爬取了 Quotes to Scrape 网站的每一个分页。接下来，你会看到，通过一些优化，我们可以显著改变这个流程的执行速度。

现在，我们就来看看怎样加快爬虫速度吧！

1. 使用更快的 HTML 解析库

数据解析需要消耗时间和资源，而不同的 HTML 解析器使用不同的方式来完成这项工作。有些解析器更注重功能丰富，提供自描述的 API；也有些以性能为优先。详细可参考我们关于最佳 HTML 解析器的指南。

在 Python 中，Beautiful Soup 是最常见的 HTML 解析库，但它不一定是最快的。可以查看一些基准测试以了解详情。

实际上，Beautiful Soup 只是对不同底层解析器进行了封装，你可以在初始化时通过第二个参数来指定要使用哪种解析器：

soup = BeautifulSoup(response.content, "html.parser")

通常，Beautiful Soup 会结合 Python 标准库内置的 html.parser 使用。然而，如果你追求性能，可以考虑 lxml。它基于 C 的实现，是 Python 中速度最快的 HTML 解析器之一。

要安装 lxml，可以运行：

pip install lxml

安装完成后，可以在 Beautiful Soup 中这样使用：

soup = BeautifulSoup(response.content, "lxml")

这时，再次运行你的 Python 爬虫脚本，输出可能会是：

# omitted for brevity...

Execution time: 4.35 seconds

执行时间从 4.61 秒降至 4.35 秒。虽然这个变化看上去不算大，但在实际中，这种优化能带来怎样的收益，很大程度上取决于你所解析的 HTML 页面规模和复杂度，以及如何选择页面元素。

在本示例中，目标站点页面结构相对简单、层次也浅。即便如此，仅仅做一点微小的改动，就能带来 6% 左右的速度提升，依然是可观的！

👍 优点：

在 Beautiful Soup 中很容易实现

👎 缺点：

提升相对较小
对复杂 DOM 结构的页面才更明显
有些更快的 HTML 解析库可能有更复杂的 API

2. 实现多进程爬虫（Multiprocessing）

多进程是一种并行执行的方法，程序会创建多个进程，每个进程可以在一个 CPU 核心上独立执行，从而实现同时处理不同任务，而不再是串行的一步接一步。

这种方法对网络爬虫等 I/O 密集型操作尤其有利。因为这些操作的主要瓶颈往往在于等待网页服务器的响应时间。借助多进程，你可以同时向多个页面发送请求，从而减少整体的爬取时间。

要将你的脚本改为多进程，需要对执行逻辑进行一些重要修改。下面的步骤展示了如何将 Python 的爬虫脚本从串行改造成多进程。

首先，从 Python 的 multiprocessing 模块中导入 Pool 和 cpu_count：

from multiprocessing import Pool, cpu_count

Pool 用于管理进程池，而 cpu_count 可以帮你确定可用于并行处理的 CPU 内核数量。

接下来，需要把爬取单个 URL 的逻辑单独封装到一个函数中：

def scrape_page(url):
    print(f"Scraping page: '{url}'")

    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    quote_html_elements = soup.select(".quote")

    quotes = []
    for quote_html_element in quote_html_elements:
        text = quote_html_element.select_one(".text").get_text()
        author = quote_html_element.select_one(".author").get_text()
        tags = [tag.get_text() for tag in quote_html_element.select(".tag")]
        quotes.append({
            "text": text,
            "author": author,
            "tags": ", ".join(tags)
        })

    print(f"Page '{url}' scraped successfullyn")

    return quotes

上面这个函数会在不同的进程中被调用，每个进程都在一个核心上串行执行一次。

然后，用多进程逻辑替代原先的串行爬取：

def scrape_quotes():
    urls = [
        "http://quotes.toscrape.com/",
        "https://quotes.toscrape.com/page/2/",
        "https://quotes.toscrape.com/page/3/",
        "https://quotes.toscrape.com/page/4/",
        "https://quotes.toscrape.com/page/5/",
        "https://quotes.toscrape.com/page/6/",
        "https://quotes.toscrape.com/page/7/",
        "https://quotes.toscrape.com/page/8/",
        "https://quotes.toscrape.com/page/9/",
        "https://quotes.toscrape.com/page/10/"
    ]

    # create a pool of processes
    with Pool(processes=cpu_count()) as pool:
        results = pool.map(scrape_page, urls)

    # flatten the results list
    quotes = [quote for sublist in results for quote in sublist]

    print("Exporting scraped data to CSV")

    with open("quotes_multiprocessing.csv", "w", newline="", encoding="utf-8") as csvfile:
        fieldnames = ["text", "author", "tags"]
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(quotes)

    print("Quotes exported to CSVn")

最后，使用 time.time() 测量脚本执行时间：

if __name__ == "__main__":
    start_time = time.time()
    scrape_quotes()
    end_time = time.time()

    execution_time = end_time - start_time
    print(f"Execution time: {execution_time:.2f} seconds")

注意，if __name__ == "__main__": 这个结构是必要的，可以避免在导入模块时代码就被执行，从而造成多进程重复创建或其他不可预期的问题，尤其在 Windows 系统上。

将上述代码整合起来：

from multiprocessing import Pool, cpu_count
import requests
from bs4 import BeautifulSoup
import csv
import time

def scrape_page(url):
    print(f"Scraping page: '{url}'")

    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    quote_html_elements = soup.select(".quote")

    quotes = []
    for quote_html_element in quote_html_elements:
        text = quote_html_element.select_one(".text").get_text()
        author = quote_html_element.select_one(".author").get_text()
        tags = [tag.get_text() for tag in quote_html_element.select(".tag")]
        quotes.append({
            "text": text,
            "author": author,
            "tags": ", ".join(tags)
        })

    print(f"Page '{url}' scraped successfullyn")

    return quotes

def scrape_quotes():
    urls = [
        "http://quotes.toscrape.com/",
        "https://quotes.toscrape.com/page/2/",
        "https://quotes.toscrape.com/page/3/",
        "https://quotes.toscrape.com/page/4/",
        "https://quotes.toscrape.com/page/5/",
        "https://quotes.toscrape.com/page/6/",
        "https://quotes.toscrape.com/page/7/",
        "https://quotes.toscrape.com/page/8/",
        "https://quotes.toscrape.com/page/9/",
        "https://quotes.toscrape.com/page/10/"
    ]

    # create a pool of processes
    with Pool(processes=cpu_count()) as pool:
        results = pool.map(scrape_page, urls)

    # flatten the results list
    quotes = [quote for sublist in results for quote in sublist]

    print("Exporting scraped data to CSV")

    with open("quotes_multiprocessing.csv", "w", newline="", encoding="utf-8") as csvfile:
        fieldnames = ["text", "author", "tags"]
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(quotes)

    print("Quotes exported to CSVn")

if __name__ == "__main__":
    start_time = time.time()
    scrape_quotes()
    end_time = time.time()

    execution_time = end_time - start_time
    print(f"Execution time: {execution_time:.2f} seconds")

再次执行脚本，这一次会输出类似下面的内容：

Scraping page: 'http://quotes.toscrape.com/'
Scraping page: 'https://quotes.toscrape.com/page/2/'
Scraping page: 'https://quotes.toscrape.com/page/3/'
Scraping page: 'https://quotes.toscrape.com/page/4/'
Scraping page: 'https://quotes.toscrape.com/page/5/'
Scraping page: 'https://quotes.toscrape.com/page/6/'
Scraping page: 'https://quotes.toscrape.com/page/7/'
Scraping page: 'https://quotes.toscrape.com/page/8/'
Page 'http://quotes.toscrape.com/' scraped successfully

Scraping page: 'https://quotes.toscrape.com/page/9/'
Page 'https://quotes.toscrape.com/page/3/' scraped successfully

Scraping page: 'https://quotes.toscrape.com/page/10/'
Page 'https://quotes.toscrape.com/page/4/' scraped successfully

Page 'https://quotes.toscrape.com/page/5/' scraped successfully

Page 'https://quotes.toscrape.com/page/6/' scraped successfully

Page 'https://quotes.toscrape.com/page/7/' scraped successfully

Page 'https://quotes.toscrape.com/page/2/' scraped successfully

Page 'https://quotes.toscrape.com/page/8/' scraped successfully

Page 'https://quotes.toscrape.com/page/9/' scraped successfully

Page 'https://quotes.toscrape.com/page/10/' scraped successfully

Exporting scraped data to CSV
Quotes exported to CSV

Execution time: 1.87 seconds

可以看到，页面的爬取顺序不再是严格按顺序执行。此时脚本会同时爬取多个页面，最大并行数量取决于 CPU 核心（这里是 8 个）。

并行处理使脚本的执行时间从大约 4.61 秒降至 1.87 秒，提升率接近 145%，效果非常显著！

👍 优点：

爬虫执行时间大幅提高
大多数编程语言都原生支持多进程

👎 缺点：

受限于机器上的 CPU 核数量
不再保证按原列表顺序爬取
需要对代码改动较多

3. 实现多线程爬虫（Multithreading）

多线程是一种在单个进程里运行多个线程的并发技术，使你的脚本能同时执行多个任务，每个任务由一个线程来处理。

它与多进程类似，但多线程并不一定需要多个 CPU 核心。因为即使在单个核心上，也可以同时跑多个线程，共享同一片内存空间。可以参考我们关于并发和并行的文章深入探讨这些概念。

要将脚本从串行转化为多线程，整体思路与多进程基本相似。

在下面的示例中，我们会使用 Python concurrent.futures 模块提供的 ThreadPoolExecutor，可以通过如下方式导入：

from concurrent.futures import ThreadPoolExecutor

ThreadPoolExecutor 提供了高级接口来管理一组线程，并并发地运行它们。

正如前面所做的，需要先把“爬取单个 URL”封装到一个单独的函数中，然后借助 ThreadPoolExecutor 并发地调用该函数：

quotes = []

# create a thread pool with up to 10 workers
with ThreadPoolExecutor(max_workers=10) as executor:
    # use map to apply the scrape_page function to each URL
    results = executor.map(scrape_page, urls)

# combine the results from all threads
for result in results:
    quotes.extend(result)

如果 max_workers 没有指定或为 None，默认值会是机器上的处理器数量乘以 5。这里我们只有 10 个页面，设置为 10 就足够了。需要注意的是，线程开太多会导致系统卡顿，甚至适得其反。

整段爬虫脚本如下：

from concurrent.futures import ThreadPoolExecutor
import requests
from bs4 import BeautifulSoup
import csv
import time

def scrape_page(url):
    print(f"Scraping page: '{url}'")

    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    quote_html_elements = soup.select(".quote")

    quotes = []
    for quote_html_element in quote_html_elements:
        text = quote_html_element.select_one(".text").get_text()
        author = quote_html_element.select_one(".author").get_text()
        tags = [tag.get_text() for tag in quote_html_element.select(".tag")]
        quotes.append({
            "text": text,
            "author": author,
            "tags": ", ".join(tags)
        })

    print(f"Page '{url}' scraped successfullyn")

    return quotes

def scrape_quotes():
    urls = [
        "http://quotes.toscrape.com/",
        "https://quotes.toscrape.com/page/2/",
        "https://quotes.toscrape.com/page/3/",
        "https://quotes.toscrape.com/page/4/",
        "https://quotes.toscrape.com/page/5/",
        "https://quotes.toscrape.com/page/6/",
        "https://quotes.toscrape.com/page/7/",
        "https://quotes.toscrape.com/page/8/",
        "https://quotes.toscrape.com/page/9/",
        "https://quotes.toscrape.com/page/10/"
    ]
    
    # where to store the scraped data
    quotes = []

    # create a thread pool with up to 10 workers
    with ThreadPoolExecutor(max_workers=10) as executor:
        # use map to apply the scrape_page function to each URL
        results = executor.map(scrape_page, urls)

    # combine the results from all threads
    for result in results:
        quotes.extend(result)

    print("Exporting scraped data to CSV")

    with open("quotes_multiprocessing.csv", "w", newline="", encoding="utf-8") as csvfile:
        fieldnames = ["text", "author", "tags"]
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(quotes)

    print("Quotes exported to CSVn")

if __name__ == "__main__":
    start_time = time.time()
    scrape_quotes()
    end_time = time.time()

    execution_time = end_time - start_time
    print(f"Execution time: {execution_time:.2f} seconds")

执行后，会出现如下日志：

Scraping page: 'http://quotes.toscrape.com/'
Scraping page: 'https://quotes.toscrape.com/page/2/'
Scraping page: 'https://quotes.toscrape.com/page/3/'
Scraping page: 'https://quotes.toscrape.com/page/4/'
Scraping page: 'https://quotes.toscrape.com/page/5/'
Scraping page: 'https://quotes.toscrape.com/page/6/'
Scraping page: 'https://quotes.toscrape.com/page/7/'
Scraping page: 'https://quotes.toscrape.com/page/8/'
Scraping page: 'https://quotes.toscrape.com/page/9/'
Scraping page: 'https://quotes.toscrape.com/page/10/'
Page 'http://quotes.toscrape.com/' scraped successfully

Page 'https://quotes.toscrape.com/page/6/' scraped successfully

Page 'https://quotes.toscrape.com/page/7/' scraped successfully

Page 'https://quotes.toscrape.com/page/10/' scraped successfully

Page 'https://quotes.toscrape.com/page/8/' scraped successfully

Page 'https://quotes.toscrape.com/page/5/' scraped successfully

Page 'https://quotes.toscrape.com/page/9/' scraped successfully

Page 'https://quotes.toscrape.com/page/4/' scraped successfully

Page 'https://quotes.toscrape.com/page/3/' scraped successfully

Page 'https://quotes.toscrape.com/page/2/' scraped successfully

Exporting scraped data to CSV
Quotes exported to CSV

Execution time: 0.52 seconds

和多进程一样，页面的执行顺序不再是串行模式。这一次，性能甚至比多进程还好，因为在我们机器上开启了 10 个线程，超过了 CPU 核数（8 个）。

运行时间从 4.61 秒下降到 0.52 秒，大约提升了 885%，相当惊人！

👍 优点：

大幅优化爬取速度
大多数语言都原生支持多线程

👎 缺点：

找出合适的线程数并不简单
网页爬取顺序不再是按列表顺序
需要对代码改动较多

4. 使用异步 / Await 模式进行爬虫

异步编程是一种现代编程范式，它能够让你编写非阻塞的代码。开发者可以在不显式管理多线程或多进程的情况下处理并发操作。

在传统的同步模型中，每个操作只有在前一个操作结束后才能执行，容易在处理 I/O 密集型操作（如网络爬虫）时效率不高。而通过异步编程，你可以同时启动多个 I/O 操作，在等待它们完成的同时，脚本依然可以“做别的事”，从而提高效率。

在 Python 中，异步爬虫通常使用标准库中的 asyncio。它借助协程来实现单线程的并发，结合 async 和 await 关键字。

不过，像 requests 这类常见的 HTTP 库并不原生支持异步，需要用专门为异步场景设计的AIOHTTP来配合 asyncio。这种组合可以让你同时发送多个 HTTP 请求，而不会阻塞脚本流程。

先使用以下命令安装 AIOHTTP：

pip install aiohttp

然后导入 asyncio 和 aiohttp：

import asyncio
import aiohttp

如之前几种方法所示，需要先将爬取单个 URL 的逻辑封装成一个异步函数：

async def scrape_url(session, url):
    async with session.get(url) as response:
        print(f"Scraping page: '{url}'")

        html_content = await response.text()
        soup = BeautifulSoup(html_content, "html.parser")
        # scraping logic...

注意，这里使用 await 来获取网页的 HTML 内容。

要并发执行这个函数，你需要在 AIOHTTP 中创建一个 session，并使用 asyncio.gather 来同时执行多个爬取任务：

# executing the scraping tasks concurrently
async with aiohttp.ClientSession() as session:
    tasks = [scrape_url(session, url) for url in urls]
    results = await asyncio.gather(*tasks)

# flatten the results list
quotes = [quote for sublist in results for quote in sublist]

最后，用 asyncio.run() 来启动你的异步主爬虫函数：

if __name__ == "__main__":
    start_time = time.time()
    asyncio.run(scrape_quotes())
    end_time = time.time()

    execution_time = end_time - start_time
    print(f"Execution time: {execution_time:.2f} seconds")

完整的异步 Python 爬虫示例如下：

import asyncio
import aiohttp
from bs4 import BeautifulSoup
import csv
import time

async def scrape_url(session, url):
    async with session.get(url) as response:
        print(f"Scraping page: '{url}'")

        html_content = await response.text()
        soup = BeautifulSoup(html_content, "html.parser")
        quote_html_elements = soup.select(".quote")

        quotes = []
        for quote_html_element in quote_html_elements:
            text = quote_html_element.select_one(".text").get_text()
            author = quote_html_element.select_one(".author").get_text()
            tags = [tag.get_text() for tag in quote_html_element.select(".tag")]
            quotes.append({
                "text": text,
                "author": author,
                "tags": ", ".join(tags)
            })

        print(f"Page '{url}' scraped successfullyn")

        return quotes

async def scrape_quotes():
    urls = [
        "http://quotes.toscrape.com/",
        "https://quotes.toscrape.com/page/2/",
        "https://quotes.toscrape.com/page/3/",
        "https://quotes.toscrape.com/page/4/",
        "https://quotes.toscrape.com/page/5/",
        "https://quotes.toscrape.com/page/6/",
        "https://quotes.toscrape.com/page/7/",
        "https://quotes.toscrape.com/page/8/",
        "https://quotes.toscrape.com/page/9/",
        "https://quotes.toscrape.com/page/10/"
    ]

    # executing the scraping tasks concurrently
    async with aiohttp.ClientSession() as session:
        tasks = [scrape_url(session, url) for url in urls]
        results = await asyncio.gather(*tasks)

    # flatten the results list
    quotes = [quote for sublist in results for quote in sublist]

    print("Exporting scraped data to CSV")

    with open("quotes_multiprocessing.csv", "w", newline="", encoding="utf-8") as csvfile:
        fieldnames = ["text", "author", "tags"]
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(quotes)

    print("Quotes exported to CSVn")

if __name__ == "__main__":
    start_time = time.time()
    asyncio.run(scrape_quotes())
    end_time = time.time()

    execution_time = end_time - start_time
    print(f"Execution time: {execution_time:.2f} seconds")

执行后，日志类似如下：

Scraping page: 'http://quotes.toscrape.com/'
Page 'http://quotes.toscrape.com/' scraped successfully                                                                

Scraping page: 'https://quotes.toscrape.com/page/3/'
Scraping page: 'https://quotes.toscrape.com/page/7/'
Scraping page: 'https://quotes.toscrape.com/page/9/'
Scraping page: 'https://quotes.toscrape.com/page/6/'
Scraping page: 'https://quotes.toscrape.com/page/8/'
Scraping page: 'https://quotes.toscrape.com/page/10/'
Page 'https://quotes.toscrape.com/page/3/' scraped successfully

Scraping page: 'https://quotes.toscrape.com/page/5/'
Scraping page: 'https://quotes.toscrape.com/page/4/'
Page 'https://quotes.toscrape.com/page/7/' scraped successfully

Page 'https://quotes.toscrape.com/page/9/' scraped successfully

Page 'https://quotes.toscrape.com/page/6/' scraped successfully

Scraping page: 'https://quotes.toscrape.com/page/2/'
Page 'https://quotes.toscrape.com/page/10/' scraped successfully

Page 'https://quotes.toscrape.com/page/5/' scraped successfully

Page 'https://quotes.toscrape.com/page/4/' scraped successfully

Page 'https://quotes.toscrape.com/page/8/' scraped successfully

Page 'https://quotes.toscrape.com/page/2/' scraped successfully

Exporting scraped data to CSV
Quotes exported to CSV

Execution time: 0.51 seconds

可以看到，执行速度与多线程类似，但优点是不用手动管理线程。

👍 优点：

能获得很大的速度提升
现代编程体系普遍推荐的异步逻辑
无需手动管理线程或进程

👎 缺点：

相对较难上手
不再按原先的列表顺序执行
需要使用专门的异步库

5. 其他加快爬虫的技巧与方法

除此之外，要加快爬虫速度，还可以考虑：

优化请求速率：调整请求的发送间隔，找到请求速度与避免限流、封禁之间的最佳平衡。
使用轮换代理：通过轮换代理使用多个 IP 地址，从而降低被封禁的风险，并实现更快的爬取。更多可参考最佳轮换代理。
分布式并行爬虫：将爬虫任务分散到多台服务器或云端机器上。
减少 JavaScript 渲染：尽量避免启动无头浏览器这类操作，优先使用 HTTP 客户端与 HTML 解析库。要知道，浏览器运行很消耗资源，通常比解析 HTML 慢很多。

总结

本指南展示了几种提高网络爬虫速度的方式，我们深入分析了爬虫脚本变慢的主要原因，并通过一个 Python 示例脚本，探讨了各种优化方法。只需小幅调整爬虫逻辑，就可以让执行时间提高 8 倍之多。

当然，通过手动写代码来优化爬虫逻辑固然重要，选择合适的工具也同样关键。若目标站点用到大量动态渲染，需要使用浏览器自动化工具时，就会变得更复杂，因为浏览器通常比较缓慢且资源占用大。

为解决这些挑战，可以尝试Scraping Browser，它是一个完全托管于云端的爬虫解决方案，能够无缝集成 Puppeteer、Selenium、Playwright 等常用浏览器自动化工具，并配有自动验证码识别功能，同时依托超过 72+ 百万住宅 IP 的代理网络，可无限拓展，轻松应对各种爬虫需求！

即刻注册，开启免费试用吧。

免费试用

用Gmail账号注册

支持支付宝等多种支付方式

如何加快网络爬虫速度：完整指南