2025 年使用 Scala 和 jsoup 进行网络爬虫

使用 Playwright 和 Scraper API 提取动态的 Pinterest 数据,实现快速且可扩展的网络爬虫。
5 min read
使用 Scala 进行网络爬虫

Python 和 JavaScript 几乎主导了整个爬虫行业。如果你需要高性能或可移植性,Scala 是一个不错的替代选择。Scala 提供了编译型、可移植且强类型的开发基础。

今天我们将介绍如何使用 Scala 和 jsoup 进行爬虫。虽然它不像Python 爬虫那样被频繁讨论,但 Scala 本身拥有不错的基础和实用的爬虫工具。

为什么选择 Scala?

与 Python 或 JavaScript 相比,你可能会因为以下原因选择 Scala:

  • 性能:Scala 编译到 JVM(Java 虚拟机)。编译器将代码翻译为可执行的字节码,这使得它本身比 Python 更快。
  • 静态类型:类型检查可以提供额外的安全保障。许多常见的错误会在程序运行前就被发现。
  • 可移植性:Scala 编译成 JVM(Java 虚拟机)字节码,只要安装了 Java,就可以运行 JVM 字节码。
  • 与 Java 完全兼容:你可以在 Scala 代码中使用 Java 依赖,大大拓宽了可用的生态系统。

开始使用

在开始之前,你需要确保已经安装了 Scala。下面我们列出了在 Ubuntu、macOS 和 Windows 上的安装指南。

你可以在这里查看完整的安装文档。

Ubuntu

curl -fL https://github.com/coursier/coursier/releases/latest/download/cs-x86_64-pc-linux.gz | gzip -d > cs && chmod +x cs && ./cs setup

macOS

brew install coursier && coursier setup

Windows

下载 适用于 Windows 的 Scala 安装包

创建一个爬虫

新建一个项目文件夹,并进入该文件夹:

mkdir quote-scraper
cd quote-scraper

初始化一个新的 Scala 项目。该命令会将新文件夹转换为 Scala 项目并创建 build.sbt 文件来管理依赖:

sbt new scala/scala3.g8

然后,打开 build.sbt。需要在其中添加 jsoup 依赖。完整的 build.sbt 文件应如下所示:

val scala3Version = "3.6.3"

lazy val root = project
  .in(file("."))
  .settings(
    name := "quote-scraper",
    version := "0.1.0-SNAPSHOT",

    scalaVersion := scala3Version,

    libraryDependencies += "org.scalameta" %% "munit" % "1.0.0" % Test,

    libraryDependencies += "org.jsoup" % "jsoup" % "1.18.3"
  )

接下来,将下面的代码复制并粘贴到 Main.scala 文件中:

import org.jsoup.Jsoup
import scala.jdk.CollectionConverters._

@main def QuotesScraper(): Unit =
  val url = "http://quotes.toscrape.com"

  try
    val document = Jsoup.connect(url).get()
    //find all objects on the page with the quote class
    val quotes = document.select(".quote")

    for quote <- quotes.asScala do
      //find the first object with the class "text" and return its text
      val text = quote.select(".text").text()
      //find the first object with the class "author" and return its text
      val author = quote.select(".author").text()
      println(s"Quote: $text")
      println(s"Author: $author")
      println("-" * 50)

  catch case e: Exception => println(s"Error: ${e.getMessage}")

运行爬虫

在项目根目录下运行以下命令即可启动爬虫:

sbt run

你应该会看到类似下面的输出:

Quote: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Author: Albert Einstein
--------------------------------------------------
Quote: “It is our choices, Harry, that show what we truly are, far more than our abilities.”
Author: J.K. Rowling
--------------------------------------------------
Quote: “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Author: Albert Einstein
--------------------------------------------------
Quote: “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
Author: Jane Austen
--------------------------------------------------
Quote: “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
Author: Marilyn Monroe
--------------------------------------------------
Quote: “Try not to become a man of success. Rather become a man of value.”
Author: Albert Einstein
--------------------------------------------------
Quote: “It is better to be hated for what you are than to be loved for what you are not.”
Author: André Gide
--------------------------------------------------
Quote: “I have not failed. I've just found 10,000 ways that won't work.”
Author: Thomas A. Edison
--------------------------------------------------
Quote: “A woman is like a tea bag; you never know how strong it is until it's in hot water.”
Author: Eleanor Roosevelt
--------------------------------------------------
Quote: “A day without sunshine is like, you know, night.”
Author: Steve Martin
--------------------------------------------------
[success] Total time: 6 s, completed Feb 18, 2025, 8:58:04 PM

使用 jsoup 进行选择

在 jsoup 中查找页面元素,我们使用 select() 方法。select() 返回匹配所给选择器的所有元素列表。让我们看看在上面那个 QuotesScraper 中的应用。

在该行中,我们通过 document.select(".quote") 来获取页面上所有 class 为 quote 的元素:

val quotes = document.select(".quote")

我们也可以使用更具结构化的方式编写选择器,比如:element[attribute='some value'],这样可以在筛选页面元素时设置更精确的过滤条件。

下面这行代码依然会返回同样的页面元素,但表达更具体:

val quotes = document.select("div[class='quote']")

我们再来看一下在代码中其他几处 select() 的应用。因为在每个 quote 中只有一个 text 元素和一个 author 元素,select() 只会返回一个文本对象和一个作者。如果一个 quote 元素中包含了多个 textauthor,那么它将返回这些元素的全部文本值。

//find objects with the class "text" and return their text
val text = quote.select(".text").text()
//find objects with the class "author" and return their text
val author = quote.select(".author").text()

使用 jsoup 提取数据

下面是我们在 jsoup 中常用的一些方法:

  • text():从一组页面元素中提取文本。当你从页面上抓取价格信息时,这些信息通常以文本的形式出现。
  • attr():从单个页面元素中提取特定属性。属性存在于标签内部,此方法通常用于获取网页中的链接。

text()

我们在最初的爬虫示例中已经用到了 text()。它会返回所选元素的文本值。如果下面的例子一次性找到了两个作者,那么 text() 会将它们的文本合并为同一个字符串。

//find objects with the class "text" and return their text
val text = quote.select(".text").text()
//find objects with the class "author" and return their text
val author = quote.select(".author").text()

attr()

attr() 的行为与 text() 不同。它从单个页面元素中提取单个属性。

//find link elements with the class "tag" and extract the "href" from the first one
val firstTagLink = quote.select("a[class='tag']").attr("href")

如果加上这行代码,我们的输出会变成下面这样:

Quote: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Author: Albert Einstein
First Tag Link: /tag/change/page/1/
--------------------------------------------------
Quote: “It is our choices, Harry, that show what we truly are, far more than our abilities.”
Author: J.K. Rowling
First Tag Link: /tag/abilities/page/1/
--------------------------------------------------
Quote: “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Author: Albert Einstein
First Tag Link: /tag/inspirational/page/1/
--------------------------------------------------
Quote: “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
Author: Jane Austen
First Tag Link: /tag/aliteracy/page/1/
--------------------------------------------------
Quote: “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
Author: Marilyn Monroe
First Tag Link: /tag/be-yourself/page/1/
--------------------------------------------------
Quote: “Try not to become a man of success. Rather become a man of value.”
Author: Albert Einstein
First Tag Link: /tag/adulthood/page/1/
--------------------------------------------------
Quote: “It is better to be hated for what you are than to be loved for what you are not.”
Author: André Gide
First Tag Link: /tag/life/page/1/
--------------------------------------------------
Quote: “I have not failed. I've just found 10,000 ways that won't work.”
Author: Thomas A. Edison
First Tag Link: /tag/edison/page/1/
--------------------------------------------------
Quote: “A woman is like a tea bag; you never know how strong it is until it's in hot water.”
Author: Eleanor Roosevelt
First Tag Link: /tag/misattributed-eleanor-roosevelt/page/1/
--------------------------------------------------
Quote: “A day without sunshine is like, you know, night.”
Author: Steve Martin
First Tag Link: /tag/humor/page/1/
--------------------------------------------------
[success] Total time: 3 s, completed Feb 18, 2025, 10:29:30 PM

其他网络爬虫工具

  • 抓取浏览器 :一个远程浏览器,已与代理整合,可与 Playwright 和 Selenium 结合使用。
  • 网络抓取 API:通过调用我们的 API 实现自动化爬虫。当你调用其中一个 Scraper API 时,我们会帮你爬取页面并将数据返回。
  • 无代码爬虫:只需告诉我们你想抓取哪个网站和哪些数据,其余工作我们来完成。
  • 数据集:也许这是所有提取方法中最简单的方案。我们会爬取数百个网站并不断更新数据库,Datasets 提供了干净且可直接分析的数据。

结论

使用 Scala 进行网络爬虫其实相当直观。你已经学到如何使用 jsoup 选择页面元素并提取它们的数据。如果爬虫不在你的计划之内,你也可以使用我们的自动化工具来简化流程,或者直接使用我们现成的数据集,跳过繁琐的爬虫过程。

立即注册并开始免费试用吧!