BeautifulSoup 中的 find() 和 find_all(): 2025 指南

了解如何在 Python 中使用 BeautifulSoup 的 find() 和 find_all() 方法,根据 class、ID、文本和属性高效抓取网页数据。
7 min read
如何使用 BeautifulSoup 的 find 和 find_all

find()find_all() 是使用 BeautifulSoup 进行网页抓取时的核心方法,可帮助你从 HTML 中提取数据。find() 方法会检索符合条件的第一个元素,例如 find("div") 会返回页面中的第一个 div 标签,如果找不到则返回 None。同时,find_all() 会找到所有符合条件的元素并以列表形式返回,对于需要提取多个元素(如所有 div 标签)非常适用。在开始使用 BeautifulSoup 进行网页抓取之前,请先确保已安装 Requests 和 BeautifulSoup。

安装依赖

pip install requests

pip install beautifulsoup4

find()

让我们先来了解一下 find()。在下面的示例中,我们会使用 Quotes To ScrapeFake Store API 来查找页面上的元素。这两个网站专为教学和演示爬虫而设计,内容变化不大,非常适合练习。

通过 Class 查找

要根据 class 来查找元素,可使用 class_ 关键字。你可能会好奇为什么是 class_ 而不是 class?这是因为 class 在 Python 中是一个关键字,用来定义类。使用 class_ 可以防止与 Python 关键字冲突。

下面的示例查找了第一个 div,其 classquote

import requests
from bs4 import BeautifulSoup

response = requests.get("https://quotes.toscrape.com")

soup = BeautifulSoup(response.text, "html.parser")

first_quote = soup.find("div", class_="quote")
print(first_quote.text)

以下是输出结果:

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
by Albert Einstein
(about)


            Tags:
            
change
deep-thoughts
thinking
world

通过 ID 查找

在进行爬虫时,你也可能需要通过 id 来查找元素。下面的示例中,我们使用 id 参数来查找页面中的菜单。我们找到这个页面上的菜单,idmenu

import requests
from bs4 import BeautifulSoup

response = requests.get("https://fakestoreapi.com")

soup = BeautifulSoup(response.text, "html.parser")

ul = soup.find("ul", id="menu")

print(ul.text)

以下是我们提取并打印在终端的菜单内容:

Home
Docs
GitHub
Buy me a coffee

通过文本查找

我们也可以根据元素的文本内容来搜索。为此需要使用 string 参数。下面的示例查找了页面上文本为 Login 的按钮:

import requests
from bs4 import BeautifulSoup

response = requests.get("https://quotes.toscrape.com")

soup = BeautifulSoup(response.text, "html.parser")

login_button = soup.find("a", string="Login")
print(login_button.text)

可以看到,Login 被打印到控制台:

Login

通过属性查找

我们也可以使用其他属性来进行更严格的筛选。这一次,我们依然查找页面上的第一个名言,但会寻找 span,其 itemproptext。这样就能只获取名言本身,而不包含作者和标签等额外信息:

import requests
from bs4 import BeautifulSoup

response = requests.get("https://quotes.toscrape.com")

soup = BeautifulSoup(response.text, "html.parser")

first_clean_quote = soup.find("span", attrs={"itemprop": "text"})

print(first_clean_quote.text)

这是我们获取到的干净名言:

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”

使用多重条件查找

你可能已经注意到了,attr 参数接收 dict,而不是单一值。这样可以让我们传入多个条件进行更精准的筛选。这里,我们通过 classitemprop 两个属性查找页面上的第一个作者:

import requests
from bs4 import BeautifulSoup

response = requests.get("https://quotes.toscrape.com")

soup = BeautifulSoup(response.text, "html.parser")

first_author = soup.find("small", attrs={"class": "author", "itemprop": "author"})
print(first_author.text)

运行后,你应该会看到 Albert Einstein 作为输出:

Albert Einstein

find_all()

现在,让我们使用 find_all() 再演示一遍同样的示例。我们依然会使用 Quotes to Scrape 和 Fake Store API。两者最主要的区别在于:find() 返回单个元素,而 find_all() 返回一个包含多个页面元素的 list

通过 Class 查找

要通过 class 属性查找元素,可使用 class_ 参数。下面这段代码使用 find_all() 来提取页面中所有 class 为 quote 的元素:

import requests
from bs4 import BeautifulSoup

response = requests.get("https://quotes.toscrape.com")

soup = BeautifulSoup(response.text, "html.parser")

quotes = soup.find_all("div", class_="quote")

for quote in quotes:
    print("-------------")
    print(quote.text)

当我们提取并打印首页所有的名言时,输出如下:

-------------

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
by Albert Einstein
(about)


            Tags:
            
change
deep-thoughts
thinking
world


-------------

“It is our choices, Harry, that show what we truly are, far more than our abilities.”
by J.K. Rowling
(about)


            Tags:
            
abilities
choices


-------------

“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
by Albert Einstein
(about)


            Tags:
            
inspirational
life
live
miracle
miracles


-------------

“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
by Jane Austen
(about)


            Tags:
            
aliteracy
books
classic
humor


-------------

“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
by Marilyn Monroe
(about)


            Tags:
            
be-yourself
inspirational


-------------

“Try not to become a man of success. Rather become a man of value.”
by Albert Einstein
(about)


            Tags:
            
adulthood
success
value


-------------

“It is better to be hated for what you are than to be loved for what you are not.”
by André Gide
(about)


            Tags:
            
life
love


-------------

“I have not failed. I've just found 10,000 ways that won't work.”
by Thomas A. Edison
(about)


            Tags:
            
edison
failure
inspirational
paraphrased


-------------

“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
by Eleanor Roosevelt
(about)


            Tags:
            
misattributed-eleanor-roosevelt


-------------

“A day without sunshine is like, you know, night.”
by Steve Martin
(about)


            Tags:
            
humor
obvious
simile

通过 ID 查找

正如在 find() 中所示,id 也是一种常用的查找方式。通过 id 来提取元素的方法与之前相同。我们在下面的代码中会查找所有 ul,它们的 idmenu。实际上页面中只会有一个,所以结果中也只会找到一个:

import requests
from bs4 import BeautifulSoup

response = requests.get("https://fakestoreapi.com")

soup = BeautifulSoup(response.text, "html.parser")

uls = soup.find_all("ul", id="menu")

for ul in uls:
    print("-------------")
    print(ul.text)

由于页面上只有一个菜单,所以输出与 find() 的结果完全相同:

-------------

Home
Docs
GitHub
Buy me a coffee

通过文本查找

下面我们根据文本内容来查找页面元素,使用的依旧是 string 参数。示例中,我们查找所有包含文本 Logina 元素。虽然我们说的是“所有”,但页面上其实只有一个:

import requests
from bs4 import BeautifulSoup

response = requests.get("https://quotes.toscrape.com")

soup = BeautifulSoup(response.text, "html.parser")

login_buttons = soup.find_all("a", string="Login")

for button in login_buttons:
    print("-------------")
    print(button)

输出结果如下:

-------------
<a href="/login">Login</a>

通过属性查找

实际爬虫中,你会经常需要使用其他属性来筛选数据。还记得我们第一次示例中的输出非常混乱吗?在下面的代码中,我们会使用 itemprop 来只提取名言内容:

import requests
from bs4 import BeautifulSoup

response = requests.get("https://quotes.toscrape.com")

soup = BeautifulSoup(response.text, "html.parser")

clean_quotes = soup.find_all("span", attrs={"itemprop": "text"})

for quote in clean_quotes:
    print("-------------")
    print(quote.text)

你可以看到我们的输出干净许多:

-------------
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
-------------
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
-------------
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
-------------
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
-------------
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
-------------
“Try not to become a man of success. Rather become a man of value.”
-------------
“It is better to be hated for what you are than to be loved for what you are not.”
-------------
“I have not failed. I've just found 10,000 ways that won't work.”
-------------
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
-------------
“A day without sunshine is like, you know, night.”

使用多重条件查找

这一次,我们会在 attrs 参数中传入更复杂的筛选条件。在下面的示例中,我们查找所有 small 元素,其 classauthoritempropauthor。通过给 attrs 传入多个键值对实现:

import requests
from bs4 import BeautifulSoup

response = requests.get("https://quotes.toscrape.com")

soup = BeautifulSoup(response.text, "html.parser")

authors = soup.find_all("small", attrs={"class": "author", "itemprop": "author"})

for author in authors:
    print("-------------")
    print(author.text)

控制台打印的作者列表如下:

-------------
Albert Einstein
-------------
J.K. Rowling
-------------
Albert Einstein
-------------
Jane Austen
-------------
Marilyn Monroe
-------------
Albert Einstein
-------------
André Gide
-------------
Thomas A. Edison
-------------
Eleanor Roosevelt
-------------
Steve Martin

进阶技巧

下面是一些进阶技巧。示例中使用 find_all(),但这些方法同样适用于 find()。重点是你想获取单个元素还是整个列表?

正则表达式 (Regex)

正则表达式在字符串匹配上非常强大。下面的示例中,我们将它和 string 参数结合起来,查找包含 einstein(忽略大小写)的所有元素:

import requests
import re
from bs4 import BeautifulSoup

response = requests.get("https://quotes.toscrape.com")

soup = BeautifulSoup(response.text, "html.parser")

pattern = re.compile(r"einstein", re.IGNORECASE)

tags = soup.find_all(string=pattern)

print(f"Total Einstein quotes: {len(tags)}")

页面上一共找到了 3 条与爱因斯坦相关的内容:

Total Einstein quotes: 3

自定义函数

现在我们来写一个自定义函数,专门返回所有与 Einstein 相关的名言。下面的例子中,我们在正则表达式的基础上做了进一步扩展。我们先通过 parent 一层层向上查找名言所在的卡片,接着再找出卡片中的所有 span,其中第一个 span 存放的就是名言内容,最后打印到控制台:

import requests
import re
from bs4 import BeautifulSoup

def find_einstein_quotes(http_response):
    soup = BeautifulSoup(http_response.text, "html.parser")

    #find all einstein tags
    pattern = re.compile(r"einstein", re.IGNORECASE)
    tags = soup.find_all(string=pattern)

    for tag in tags:
        #follow the parents until we have the quote card
        full_card = tag.parent.parent.parent

        #find the spans
        spans = full_card.find_all("span")

        #print the first span, it contains the actual quote
        print(spans[0].text)


if __name__ == "__main__":
    response = requests.get("https://quotes.toscrape.com")
    find_einstein_quotes(response)

这是我们的输出:

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
“Try not to become a man of success. Rather become a man of value.”

额外收获:使用 CSS 选择器进行查找

BeautifulSoup 的 select 方法和 find_all() 的工作方式几乎相同,但它更灵活一些。这个方法接收一个 CSS 选择器,只要你能写出 CSS 选择器,就能用它来查找想要的元素。下面这段代码,我们使用了多个属性来找到所有作者,但把这些属性写成一个 CSS 选择器:

import requests
from bs4 import BeautifulSoup

response = requests.get("https://quotes.toscrape.com")

soup = BeautifulSoup(response.text, "html.parser")

authors = soup.select("small[class='author'][itemprop='author']")

for author in authors:
    print("-------------")
    print(author.text)

输出如下:

-------------
Albert Einstein
-------------
J.K. Rowling
-------------
Albert Einstein
-------------
Jane Austen
-------------
Marilyn Monroe
-------------
Albert Einstein
-------------
André Gide
-------------
Thomas A. Edison
-------------
Eleanor Roosevelt
-------------
Steve Martin

结论

到这里,你已经基本了解了 find()find_all() 在 BeautifulSoup 中的用法。你不需要全部方法都烂熟于心,但请记住 BeautifulSoup 提供了丰富的查找方式,灵活运用于任何网页数据的获取。在生产环境中,如果你需要高成功率的快速爬取,你可以考虑使用我们的 Residential ProxiesScraping Browser(内置代理管理及 CAPTCHA 解决能力)。

立即注册并开始免费试用,找到最适合你需求的产品吧。