find()
和 find_all()
是使用 BeautifulSoup 进行网页抓取时的核心方法,可帮助你从 HTML 中提取数据。find()
方法会检索符合条件的第一个元素,例如 find("div")
会返回页面中的第一个 div
标签,如果找不到则返回 None
。同时,find_all()
会找到所有符合条件的元素并以列表形式返回,对于需要提取多个元素(如所有 div
标签)非常适用。在开始使用 BeautifulSoup 进行网页抓取之前,请先确保已安装 Requests 和 BeautifulSoup。
安装依赖
pip install requests
pip install beautifulsoup4
find()
让我们先来了解一下 find()
。在下面的示例中,我们会使用 Quotes To Scrape 和 Fake Store API 来查找页面上的元素。这两个网站专为教学和演示爬虫而设计,内容变化不大,非常适合练习。
通过 Class 查找
要根据 class
来查找元素,可使用 class_
关键字。你可能会好奇为什么是 class_
而不是 class
?这是因为 class
在 Python 中是一个关键字,用来定义类。使用 class_
可以防止与 Python 关键字冲突。
下面的示例查找了第一个 div
,其 class
为 quote
:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://quotes.toscrape.com")
soup = BeautifulSoup(response.text, "html.parser")
first_quote = soup.find("div", class_="quote")
print(first_quote.text)
以下是输出结果:
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
by Albert Einstein
(about)
Tags:
change
deep-thoughts
thinking
world
通过 ID 查找
在进行爬虫时,你也可能需要通过 id
来查找元素。下面的示例中,我们使用 id
参数来查找页面中的菜单。我们找到这个页面上的菜单,id
为 menu
。
import requests
from bs4 import BeautifulSoup
response = requests.get("https://fakestoreapi.com")
soup = BeautifulSoup(response.text, "html.parser")
ul = soup.find("ul", id="menu")
print(ul.text)
以下是我们提取并打印在终端的菜单内容:
Home
Docs
GitHub
Buy me a coffee
通过文本查找
我们也可以根据元素的文本内容来搜索。为此需要使用 string
参数。下面的示例查找了页面上文本为 Login
的按钮:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://quotes.toscrape.com")
soup = BeautifulSoup(response.text, "html.parser")
login_button = soup.find("a", string="Login")
print(login_button.text)
可以看到,Login
被打印到控制台:
Login
通过属性查找
我们也可以使用其他属性来进行更严格的筛选。这一次,我们依然查找页面上的第一个名言,但会寻找 span
,其 itemprop
为 text
。这样就能只获取名言本身,而不包含作者和标签等额外信息:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://quotes.toscrape.com")
soup = BeautifulSoup(response.text, "html.parser")
first_clean_quote = soup.find("span", attrs={"itemprop": "text"})
print(first_clean_quote.text)
这是我们获取到的干净名言:
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
使用多重条件查找
你可能已经注意到了,attr
参数接收 dict
,而不是单一值。这样可以让我们传入多个条件进行更精准的筛选。这里,我们通过 class
和 itemprop
两个属性查找页面上的第一个作者:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://quotes.toscrape.com")
soup = BeautifulSoup(response.text, "html.parser")
first_author = soup.find("small", attrs={"class": "author", "itemprop": "author"})
print(first_author.text)
运行后,你应该会看到 Albert Einstein
作为输出:
Albert Einstein
find_all()
现在,让我们使用 find_all()
再演示一遍同样的示例。我们依然会使用 Quotes to Scrape 和 Fake Store API。两者最主要的区别在于:find()
返回单个元素,而 find_all()
返回一个包含多个页面元素的 list
。
通过 Class 查找
要通过 class
属性查找元素,可使用 class_
参数。下面这段代码使用 find_all()
来提取页面中所有 class 为 quote
的元素:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://quotes.toscrape.com")
soup = BeautifulSoup(response.text, "html.parser")
quotes = soup.find_all("div", class_="quote")
for quote in quotes:
print("-------------")
print(quote.text)
当我们提取并打印首页所有的名言时,输出如下:
-------------
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
by Albert Einstein
(about)
Tags:
change
deep-thoughts
thinking
world
-------------
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
by J.K. Rowling
(about)
Tags:
abilities
choices
-------------
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
by Albert Einstein
(about)
Tags:
inspirational
life
live
miracle
miracles
-------------
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
by Jane Austen
(about)
Tags:
aliteracy
books
classic
humor
-------------
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
by Marilyn Monroe
(about)
Tags:
be-yourself
inspirational
-------------
“Try not to become a man of success. Rather become a man of value.”
by Albert Einstein
(about)
Tags:
adulthood
success
value
-------------
“It is better to be hated for what you are than to be loved for what you are not.”
by André Gide
(about)
Tags:
life
love
-------------
“I have not failed. I've just found 10,000 ways that won't work.”
by Thomas A. Edison
(about)
Tags:
edison
failure
inspirational
paraphrased
-------------
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
by Eleanor Roosevelt
(about)
Tags:
misattributed-eleanor-roosevelt
-------------
“A day without sunshine is like, you know, night.”
by Steve Martin
(about)
Tags:
humor
obvious
simile
通过 ID 查找
正如在 find()
中所示,id
也是一种常用的查找方式。通过 id
来提取元素的方法与之前相同。我们在下面的代码中会查找所有 ul
,它们的 id
为 menu
。实际上页面中只会有一个,所以结果中也只会找到一个:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://fakestoreapi.com")
soup = BeautifulSoup(response.text, "html.parser")
uls = soup.find_all("ul", id="menu")
for ul in uls:
print("-------------")
print(ul.text)
由于页面上只有一个菜单,所以输出与 find()
的结果完全相同:
-------------
Home
Docs
GitHub
Buy me a coffee
通过文本查找
下面我们根据文本内容来查找页面元素,使用的依旧是 string
参数。示例中,我们查找所有包含文本 Login
的 a
元素。虽然我们说的是“所有”,但页面上其实只有一个:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://quotes.toscrape.com")
soup = BeautifulSoup(response.text, "html.parser")
login_buttons = soup.find_all("a", string="Login")
for button in login_buttons:
print("-------------")
print(button)
输出结果如下:
-------------
<a href="/login">Login</a>
通过属性查找
实际爬虫中,你会经常需要使用其他属性来筛选数据。还记得我们第一次示例中的输出非常混乱吗?在下面的代码中,我们会使用 itemprop
来只提取名言内容:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://quotes.toscrape.com")
soup = BeautifulSoup(response.text, "html.parser")
clean_quotes = soup.find_all("span", attrs={"itemprop": "text"})
for quote in clean_quotes:
print("-------------")
print(quote.text)
你可以看到我们的输出干净许多:
-------------
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
-------------
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
-------------
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
-------------
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
-------------
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
-------------
“Try not to become a man of success. Rather become a man of value.”
-------------
“It is better to be hated for what you are than to be loved for what you are not.”
-------------
“I have not failed. I've just found 10,000 ways that won't work.”
-------------
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
-------------
“A day without sunshine is like, you know, night.”
使用多重条件查找
这一次,我们会在 attrs
参数中传入更复杂的筛选条件。在下面的示例中,我们查找所有 small
元素,其 class
为 author
且 itemprop
为 author
。通过给 attrs
传入多个键值对实现:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://quotes.toscrape.com")
soup = BeautifulSoup(response.text, "html.parser")
authors = soup.find_all("small", attrs={"class": "author", "itemprop": "author"})
for author in authors:
print("-------------")
print(author.text)
控制台打印的作者列表如下:
-------------
Albert Einstein
-------------
J.K. Rowling
-------------
Albert Einstein
-------------
Jane Austen
-------------
Marilyn Monroe
-------------
Albert Einstein
-------------
André Gide
-------------
Thomas A. Edison
-------------
Eleanor Roosevelt
-------------
Steve Martin
进阶技巧
下面是一些进阶技巧。示例中使用 find_all()
,但这些方法同样适用于 find()
。重点是你想获取单个元素还是整个列表?
正则表达式 (Regex)
正则表达式在字符串匹配上非常强大。下面的示例中,我们将它和 string
参数结合起来,查找包含 einstein
(忽略大小写)的所有元素:
import requests
import re
from bs4 import BeautifulSoup
response = requests.get("https://quotes.toscrape.com")
soup = BeautifulSoup(response.text, "html.parser")
pattern = re.compile(r"einstein", re.IGNORECASE)
tags = soup.find_all(string=pattern)
print(f"Total Einstein quotes: {len(tags)}")
页面上一共找到了 3 条与爱因斯坦相关的内容:
Total Einstein quotes: 3
自定义函数
现在我们来写一个自定义函数,专门返回所有与 Einstein 相关的名言。下面的例子中,我们在正则表达式的基础上做了进一步扩展。我们先通过 parent
一层层向上查找名言所在的卡片,接着再找出卡片中的所有 span
,其中第一个 span
存放的就是名言内容,最后打印到控制台:
import requests
import re
from bs4 import BeautifulSoup
def find_einstein_quotes(http_response):
soup = BeautifulSoup(http_response.text, "html.parser")
#find all einstein tags
pattern = re.compile(r"einstein", re.IGNORECASE)
tags = soup.find_all(string=pattern)
for tag in tags:
#follow the parents until we have the quote card
full_card = tag.parent.parent.parent
#find the spans
spans = full_card.find_all("span")
#print the first span, it contains the actual quote
print(spans[0].text)
if __name__ == "__main__":
response = requests.get("https://quotes.toscrape.com")
find_einstein_quotes(response)
这是我们的输出:
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
“Try not to become a man of success. Rather become a man of value.”
额外收获:使用 CSS 选择器进行查找
BeautifulSoup 的 select
方法和 find_all()
的工作方式几乎相同,但它更灵活一些。这个方法接收一个 CSS 选择器,只要你能写出 CSS 选择器,就能用它来查找想要的元素。下面这段代码,我们使用了多个属性来找到所有作者,但把这些属性写成一个 CSS 选择器:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://quotes.toscrape.com")
soup = BeautifulSoup(response.text, "html.parser")
authors = soup.select("small[class='author'][itemprop='author']")
for author in authors:
print("-------------")
print(author.text)
输出如下:
-------------
Albert Einstein
-------------
J.K. Rowling
-------------
Albert Einstein
-------------
Jane Austen
-------------
Marilyn Monroe
-------------
Albert Einstein
-------------
André Gide
-------------
Thomas A. Edison
-------------
Eleanor Roosevelt
-------------
Steve Martin
结论
到这里,你已经基本了解了 find()
和 find_all()
在 BeautifulSoup 中的用法。你不需要全部方法都烂熟于心,但请记住 BeautifulSoup 提供了丰富的查找方式,灵活运用于任何网页数据的获取。在生产环境中,如果你需要高成功率的快速爬取,你可以考虑使用我们的 Residential Proxies 或 Scraping Browser(内置代理管理及 CAPTCHA 解决能力)。
立即注册并开始免费试用,找到最适合你需求的产品吧。