所有互联网交互都需要使用 IP 地址。网站使用 IP 地址来识别一个或多个用户,并确定您的位置和其他元数据,例如您的互联网服务提供商(ISP)、时区或设备类型。Web 服务器使用此信息来协助定制或限制有关内容或资源。这意味着,在进行网页抓取时,如果网站认为流量模式或行为异常、类似机器人或出于恶意,则可以阻止来自您的 IP 地址的请求。值得庆幸的是,代理服务器可以协助解决这一问题。
代理服务器 是一个中间服务器,充当用户和互联网之间的网关。它接收来自用户的请求,将其转发到 Web 资源,然后将获取的数据返回给用户。代理服务器通过隐藏您的真实 IP 地址,并增强安全性、隐私度和匿名性来帮助您谨慎地浏览和抓取。
代理服务器还可以通过更改您的 IP 地址来帮助 规避 IP 禁令 ,使请求看起来好像来自其他用户。位于不同地区的代理服务器使您能够绕过 地理封锁,访问特定地理位置的内容,例如电影或新闻。
在本文中,您将了解如何在 Go 中设置代理服务器以进行网页抓取。您还将了解 Bright Data 代理服务器 及其如何帮助您简化这一过程。
设置代理服务器
在本教程中,您将了解如何修改用 Go 编写的 Web 抓取工具应用程序,以便通过本地或自托管的代理服务器与 It’s FOSS 网站 进行交互。本教程假设您的 Go 开发环境已经设置就绪。
首先,您需要使用 Squid(一款开源代理服务器软件)设置代理服务器。如果您熟悉其他代理服务器软件,也可以使用。以下文章有关如何在 Fedora 39 Linux 机箱上使用 Squid。对于大多数 Linux 发行版来说,Squid 包含在默认存储库中。您也可以查看 文档 ,下载操作系统所需的软件包。
从您的终端执行以下命令来安装 Squid:
dnf install squid -y
完成后,通过执行以下命令启用服务:
sudo systemctl enable --now squid
使用以下命令检查服务状态:
sudo systemctl status squid
输出结果应如下所示:
这表示该服务处于活动状态并正在运行。默认情况下,Squid 在端口 3128 上运行并监听请求。使用以下 curl 命令测试通过代理服务器进行的通信:
curl --proxy 127.0.0.1:3128 "http://lumtest.com/myip.json"
您的响应如下所示:
curl --proxy 127.0.0.1:3128 "http://lumtest.com/myip.json"
{"ip":"196.43.196.126","country":"GH","asn":{"asnum":327695,"org_name":"AITI"},"geo":{"city":"","region":"","region_name":"","postal_code":"","latitude":8.1,"longitude":-1.2,"tz":"Africa/Accra"}}
元数据应包括您的公有 IP 地址以及拥有该地址的国家/地区和组织。它还确认您已成功安装了有效的代理服务器。
设置演示抓取工具
为了更便于您执行操作,该 GitHub 存储库 提供了一个简单的 Go 网页抓取工具应用程序。抓取工具可以获取 It’s FOSS上最新博客文章的标题、摘录和类别。It’s FOSS 是讨论开源软件产品的热门博客。然后,抓取工具访问 Lumtest 以获取使用该抓取工具的 HTTP 客户端用于与网络交互所使用的 IP 地址的有关信息。使用以下三个不同的 Go 软件包可以实现同样的逻辑: Colly、 goquery 和 Selenium。在下一节中,您将了解如何修改每项实现以使用代理服务器。
首先在您收藏的终端/壳(shell)中执行以下命令来克隆存储库:
$ git clone https://github.com/rexfordnyrk/go_scrap_proxy.git
该存储库由两个分支组成: 主
分支,它包含已完成的代码,以及 基础
分支,它包含所要修改的初始代码。使用以下命令 检出
基础
分支:
$ git checkout basic
此分支包含三个 .go
文件,用于每个未配置代理的存储库的实现。它还包含一个可执行文件 chromedriver
,这是抓取工具的 Selenium 实现所必需的文件:
.
├── chromedriver
├── colly.go
├── go.mod
├── goquery.go
├── go.sum
├── LICENSE
├── README.md
└── selenium.go
1 directory, 8 files
您可以使用带有特定文件名的 go run
命令单独运行其中任何一个。例如,以下命令使用 Colly 运行抓取工具:
go run ./colly.go
输出结果应如下所示:
$ go run ./colly.go
Article 0: {"category":"Newsletter ✉️","excerpt":"Unwind your new year celebration with new open-source projects, and keep an eye on interesting distro updates.","title":"FOSS Weekly #24.02: Mixing AI With Linux, Vanilla OS 2, and More"}
Article 1: {"category":"Tutorial","excerpt":"Wondering how to use tiling windows on GNOME? Try the tiling assistant. Here's how it works.","title":"How to Use Tiling Assistant on GNOME Desktop?"}
Article 2: {"category":"Linux Commands","excerpt":"The free command in Linux helps you gain insights on system memory usage (RAM), and more. Here's how to make good use of it.","title":"Free Command Examples"}
Article 3: {"category":"Gaming 🎮","excerpt":"Here are the best tips to make your Linux gaming experience enjoyable.","title":"7 Tips and Tools to Improve Your Gaming Experience on Linux"}
Article 4: {"category":"Newsletter ✉️","excerpt":"The first edition of FOSS Weekly in the year 2024 is here. See, what's new in the new year.","title":"FOSS Weekly #24.01: Linux in 2024, GDM Customization, Distros You Missed Last Year"}
Article 5: {"category":"Tutorial","excerpt":"Wondering which init service your Linux system uses? Here's how to find it out.","title":"How to Check if Your Linux System Uses systemd"}
Article 6: {"category":"Ubuntu","excerpt":"Learn the logic behind each step you have to follow for adding an external repository in Ubuntu and installing packages from it.","title":"Installing Packages From External Repositories in Ubuntu [Explained]"}
Article 7: {"category":"Troubleshoot 🔬","excerpt":"Getting a warning that the boot partition has no space left? Here are some ways you can free up space on the boot partition in Ubuntu Linux.","title":"How to Free Up Space in /boot Partition on Ubuntu Linux?"}
Article 8: {"category":"Ubuntu","excerpt":"Wondering which Ubuntu version you're using? Here's how to check your Ubuntu version, desktop environment and other relevant system information.","title":"How to Check Ubuntu Version Details and Other System Information"}
Check Proxy IP map[asn:map[asnum:29614 org_name:VODAFONE GHANA AS INTERNATIONAL TRANSIT] country:GH geo:map[city:Accra latitude:5.5486 longitude:-0.2012 lum_city:accra lum_region:aa postal_code: region:AA region_name:Greater Accra Region tz:Africa/Accra] ip:197.251.144.148]
此输出包含从 It’s FOSS 中抓取的所有文章信息。在输出的底部,您将找到从 Lumtest 返回的 IP 信息,告知您抓取工具当前使用的连接。执行所有三个实现会带来类似的响应。三项测试结束后,即可开始使用本地代理进行抓取。
使用本地代理执行抓取工具的实现
在本节中,您将了解抓取工具的所有三种实现,并对其进行修改以使用代理服务器。每个 .go
文件均由应用程序启动时的 main()
函数和包含抓取指令的 ScrapeWithLibrary()
函数组成。
goquery 用于本地代理
goquery 是 Go 的一个存储库,提供了一组用于解析和操作 HTML 文档的方法和功能,类似于 jQuery 在 JavaScript 中的运行方式。由于它允许您对 HTML 页面结构执行遍历、查询和操作,因而对网页抓取特别有用。但是,该存储库不处理任何形式的网络请求或操作,这意味着必须获取 HTML 页面并提供给它 。
如果导航到 goquery.go
文件,您会找到网页抓取工具的 goquery 实现。在您收藏的 IDE(集成开发环境应用程序)或文本编辑器中将其打开。
在 ScrapeWithGoquery()
函数中,您需要使用 HTTP 代理服务器的 URL 修改 HTTP 客户端的传输,该 URL 是主机名或 IP 和端口的组合,格式为 http://HOST:PORT
。
务必在此文件中导入 net/url
软件包。使用以下代码段粘贴并替换 HTTP 客户端定义:
...
func ScrapeWithGoquery() {
// Define the URL of the proxy server
proxyStr := "http://127.0.0.1:3128"
// Parse the proxy URL
proxyURL, err := url.Parse(proxyStr)
if err != nil {
fmt.Println("Error parsing proxy URL:", err)
return
}
//Create an http.Transport that uses the proxy
transport := &http.Transport{
Proxy: http.ProxyURL(proxyURL),
}
// Create an HTTP client with the transport
client := &http.Client{
Transport: transport,
}
...
此代码段将修改 HTTP 客户端,将传输配置为使用本地代理服务器。务必将 IP 地址替换为代理服务器 IP 地址。
现在,使用项目目录中的以下命令运行此实现:
go run ./goquery.go
输出结果应如下所示:
$ go run ./goquery.go
Article 0: {"category":"Newsletter ✉️","excerpt":"Unwind your new year celebration with new open-source projects, and keep an eye on interesting distro updates.","title":"FOSS Weekly #24.02: Mixing AI With Linux, Vanilla OS 2, and More"}
Article 1: {"category":"Tutorial","excerpt":"Wondering how to use tiling windows on GNOME? Try the tiling assistant. Here's how it works.","title":"How to Use Tiling Assistant on GNOME Desktop?"}
Article 2: {"category":"Linux Commands","excerpt":"The free command in Linux helps you gain insights on system memory usage (RAM), and more. Here's how to make good use of it.","title":"Free Command Examples"}
Article 3: {"category":"Gaming 🎮","excerpt":"Here are the best tips to make your Linux gaming experience enjoyable.","title":"7 Tips and Tools to Improve Your Gaming Experience on Linux"}
Article 4: {"category":"Newsletter ✉️","excerpt":"The first edition of FOSS Weekly in the year 2024 is here. See, what's new in the new year.","title":"FOSS Weekly #24.01: Linux in 2024, GDM Customization, Distros You Missed Last Year"}
Article 5: {"category":"Tutorial","excerpt":"Wondering which init service your Linux system uses? Here's how to find it out.","title":"How to Check if Your Linux System Uses systemd"}
Article 6: {"category":"Ubuntu","excerpt":"Learn the logic behind each step you have to follow for adding an external repository in Ubuntu and installing packages from it.","title":"Installing Packages From External Repositories in Ubuntu [Explained]"}
Article 7: {"category":"Troubleshoot 🔬","excerpt":"Getting a warning that the boot partition has no space left? Here are some ways you can free up space on the boot partition in Ubuntu Linux.","title":"How to Free Up Space in /boot Partition on Ubuntu Linux?"}
Article 8: {"category":"Ubuntu","excerpt":"Wondering which Ubuntu version you're using? Here's how to check your Ubuntu version, desktop environment and other relevant system information.","title":"How to Check Ubuntu Version Details and Other System Information"}
Check Proxy IP map[asn:map[asnum:29614 org_name:VODAFONE GHANA AS INTERNATIONAL TRANSIT] country:GH geo:map[city:Accra latitude:5.5486 longitude:-0.2012 lum_city:accra lum_region:aa postal_code: region:AA region_name:Greater Accra Region tz:Africa/Accra] ip:197.251.144.148]
Colly 用于本地代理
Colly 是一款高效的多功能 Go 网页抓取框架,以其用户友好的 API(应用程序接口)以及与 goquery
等 HTML 解析库的无缝集成而知名。但是,与 goquery
不同,它支持并提供用于处理各种网络相关行为的 API,包括用于高速抓取的异步请求、本地缓存和速率限制,以确保能够高效、负责地使用 Web 资源、自动处理 cookie 和会话、使用可自定义的用户代理并进行全面的错误处理。此外,它支持通过代理切换或轮换来使用代理,并且可以通过与无头浏览器集成将其扩展到抓取 JavaScript 生成的内容等任务。
在编辑器或 IDE 中打开 colly.go
文件,在 ScrapeWithColly()
函数中初始化新收集器后,立即粘贴以下几行代码:
...
// Define the URL of the proxy server
proxyStr := "http://127.0.0.1:3128"
// SetProxy sets a proxy for the collector
if err := c.SetProxy(proxyStr); err != nil {
log.Fatalf("Error setting proxy configuration: %v", err)
}
...
此代码段使用 Colly 的 SetProxy()
方法来定义代理服务器,供该收集器实例用于处理网络请求。
现在,使用项目目录中的以下命令运行此实现:
go run ./colly.go
输出结果应如下所示:
$ go run ./colly.go
Article 0: {"category":"Newsletter ✉️","excerpt":"Unwind your new year celebration with new open-source projects, and keep an eye on interesting distro updates.","title":"FOSS Weekly #24.02: Mixing AI With Linux, Vanilla OS 2, and More"}
Article 1: {"category":"Tutorial","excerpt":"Wondering how to use tiling windows on GNOME? Try the tiling assistant. Here's how it works.","title":"How to Use Tiling Assistant on GNOME Desktop?"}
Article 2: {"category":"Linux Commands","excerpt":"The free command in Linux helps you gain insights on system memory usage (RAM), and more. Here's how to make good use of it.","title":"Free Command Examples"}
Article 3: {"category":"Gaming 🎮","excerpt":"Here are the best tips to make your Linux gaming experience enjoyable.","title":"7 Tips and Tools to Improve Your Gaming Experience on Linux"}
Article 4: {"category":"Newsletter ✉️","excerpt":"The first edition of FOSS Weekly in the year 2024 is here. See, what's new in the new year.","title":"FOSS Weekly #24.01: Linux in 2024, GDM Customization, Distros You Missed Last Year"}
Article 5: {"category":"Tutorial","excerpt":"Wondering which init service your Linux system uses? Here's how to find it out.","title":"How to Check if Your Linux System Uses systemd"}
Article 6: {"category":"Ubuntu","excerpt":"Learn the logic behind each step you have to follow for adding an external repository in Ubuntu and installing packages from it.","title":"Installing Packages From External Repositories in Ubuntu [Explained]"}
Article 7: {"category":"Troubleshoot 🔬","excerpt":"Getting a warning that the boot partition has no space left? Here are some ways you can free up space on the boot partition in Ubuntu Linux.","title":"How to Free Up Space in /boot Partition on Ubuntu Linux?"}
Article 8: {"category":"Ubuntu","excerpt":"Wondering which Ubuntu version you're using? Here's how to check your Ubuntu version, desktop environment and other relevant system information.","title":"How to Check Ubuntu Version Details and Other System Information"}
Check Proxy IP map[asn:map[asnum:29614 org_name:VODAFONE GHANA AS INTERNATIONAL TRANSIT] country:GH geo:map[city:Accra latitude:5.5486 longitude:-0.2012 lum_city:accra lum_region:aa postal_code: region:AA region_name:Greater Accra Region tz:Africa/Accra] ip:197.251.144.148]
Selenium 用于本地代理
Selenium 是一种工具,主要用于在 Web 应用程序测试中自动执行 Web 浏览器交互。它能够执行诸如点击按钮、输入文本和从网页提取数据之类的任务,因此非常适合通过自动交互来抓取网页内容。通过 WebDriver,模仿真实的用户交互成为可能,Selenium 使用它来控制浏览器。虽然此示例使用了 Chrome,但 Selenium 还支持其他浏览器,包括 Firefox、Safari 和 IE 浏览器。
Selenium WebDriver 服务允许您提供代理和其他配置,以影响底层浏览器与网络交互时的行为,就像实际浏览器一样。以编程方式,这可以通过 selelium.Capabilities{}
定义进行配置。
要将 Selenium 用于本地代理,请编辑 selenium.go
文件,该文件位于 ScrapeWithSelenium()
中,并用以下代码段替换 selelium.Capabilities{}
的定义:
...
// Define proxy settings
proxy := selenium.Proxy{
Type: selenium.Manual,
HTTP: "127.0.0.1:3128", // Replace with your proxy settings
SSL: "127.0.0.1:3128", // Replace with your proxy settings
}
// Configuring the WebDriver instance with the proxy
caps := selenium.Capabilities{
"browserName": "chrome",
"proxy": proxy,
}
...
此代码段定义了 Selenium 的各种代理参数,用于配置 Selenium 的 WebDriver 功能。下次执行时,将使用代理连接。
现在,使用项目目录中的以下命令运行实现:
go run ./selenium.go
输出结果应如下所示:
$ go run ./selenium.go
Article 0: {"category":"Newsletter ✉️","excerpt":"Unwind your new year celebration with new open-source projects, and keep an eye on interesting distro updates.","title":"FOSS Weekly #24.02: Mixing AI With Linux, Vanilla OS 2, and More"}
Article 1: {"category":"Tutorial","excerpt":"Wondering how to use tiling windows on GNOME? Try the tiling assistant. Here's how it works.","title":"How to Use Tiling Assistant on GNOME Desktop?"}
Article 2: {"category":"Linux Commands","excerpt":"The free command in Linux helps you gain insights on system memory usage (RAM), and more. Here's how to make good use of it.","title":"Free Command Examples"}
Article 3: {"category":"Gaming 🎮","excerpt":"Here are the best tips to make your Linux gaming experience enjoyable.","title":"7 Tips and Tools to Improve Your Gaming Experience on Linux"}
Article 4: {"category":"Newsletter ✉️","excerpt":"The first edition of FOSS Weekly in the year 2024 is here. See, what's new in the new year.","title":"FOSS Weekly #24.01: Linux in 2024, GDM Customization, Distros You Missed Last Year"}
Article 5: {"category":"Tutorial","excerpt":"Wondering which init service your Linux system uses? Here's how to find it out.","title":"How to Check if Your Linux System Uses systemd"}
Article 6: {"category":"Ubuntu","excerpt":"Learn the logic behind each step you have to follow for adding an external repository in Ubuntu and installing packages from it.","title":"Installing Packages From External Repositories in Ubuntu [Explained]"}
Article 7: {"category":"Troubleshoot 🔬","excerpt":"Getting a warning that the boot partition has no space left? Here are some ways you can free up space on the boot partition in Ubuntu Linux.","title":"How to Free Up Space in /boot Partition on Ubuntu Linux?"}
Article 8: {"category":"Ubuntu","excerpt":"Wondering which Ubuntu version you're using? Here's how to check your Ubuntu version, desktop environment and other relevant system information.","title":"How to Check Ubuntu Version Details and Other System Information"}
Check Proxy IP {"ip":"197.251.144.148","country":"GH","asn":{"asnum":29614,"org_name":"VODAFONE GHANA AS INTERNATIONAL TRANSIT"},"geo":{"city":"Accra","region":"AA","region_name":"Greater Accra Region","postal_code":"","latitude":5.5486,"longitude":-0.2012,"tz":"Africa/Accra","lum_city":"accra","lum_region":"aa"}}
虽然您可以自己维护代理服务器,但会受到各种因素的限制,包括为不同的区域设置新服务器以及解决其他维护和安全问题。
Bright Data 代理服务器
Bright Data 拥有屡获殊荣的全球代理网络基础架构,并提供一整套可用于各种网络数据收集目的的代理服务器和服务。
利用 Bright Data 庞大的全球代理服务器网络,您可以轻松访问并收集来自不同国际地点的数据。Bright Data 还提供一系列代理类型,为超过 3.5 亿特定 住宅用户、 ISP、 数据中心和 移动 代理提供服务,每种代理都在诸如合法性、速度和可靠性等方面为特定的网络数据收集任务带来独有优势。
此外, Bright Data 的代理轮换系统 可确保高度匿名性并最大限度减少检测,使其成为持续的大规模网络数据收集的理想之选。
使用 Bright Data 设置住宅代理
使用 Bright Data 可以轻松获得住宅代理。您仅需 注册 免费试用即可。注册后,您会看到如下内容:
点击 “开始”(Get started) 按钮,获取 住宅代理。
系统将提示您填写以下表格:
继续并为该实例提供一个名称。例如此处,名称为 my_go_demo_proxy
。您还需要指定要配置的 IP 类型:选择 “共享”(Shared)。然后提供您在访问网络内容时想要模仿的地理位置等级。默认情况下,此为 国家/地区 级别或区域。您还需要指定是否要对请求的网页进行缓存。现在暂时关闭缓存。
填写完此信息后,点击 “添加”(Add) 为您的住宅代理配置做好准备。
接下来,需要激活您的住宅代理。但是,作为新用户,首先会要求您提供账单信息。完成该步骤后,导航到您的控制面板并点击您刚才创建的住宅代理:
确保选择 “访问参数”(Access parameters) 选项卡。
您将在此处找到使用住宅代理所需的各种参数,例如主机、端口和身份验证凭据。您很快会用到这些信息。
现在,即可将您的 Bright Data 住宅代理与抓取工具的所有三种实现集成在一起。虽然这与设置本地服务器的过程类似,但此处还将包括身份验证。此外,由于您以编程方式与 Web 交互,因此可能无法像在带有图形用户界面的浏览器中那样查看和接受来自代理服务器的 SSL 证书。因此,您需要以编程方式在 Web 客户端上禁用 SSL 证书验证,以使您的请求不间断。
首先在项目目录中创建一个名为 brightdata
的目录,然后将三个 .go
文件复制到 brightdata
目录中。目录结构应如下所示:
.
├── brightdata
│ ├── colly.go
│ ├── goquery.go
│ └── selenium.go
├── chromedriver
├── colly.go
├── go.mod
├── goquery.go
├── go.sum
├── LICENSE
├── README.md
└── selenium.go
2 directories, 11 files
接下来,您将修改 brightdata
目录中的文件。
goquery 用于 Bright Data 住宅代理
在 ScrapeWithGoquery()
函数中,您需要修改 proxyStr
变量,将身份验证凭据按照以下格式包含在代理 URL 中 http://USERNAME:PASSWORD@HOST:PORT
。用以下代码段替换当前定义:
...
func ScrapeWithGoquery() {
// Define the proxy server with username and password
proxyUsername := "username" //Your residential proxy username
proxyPassword := "your_password" //Your Residential Proxy password here
proxyHost := "server_host" //Your Residential Proxy Host
proxyPort := "server_port" //Your Port here
proxyStr := fmt.Sprintf("http://%s:%s@%s:%s", url.QueryEscape(proxyUsername), url.QueryEscape(proxyPassword), proxyHost, proxyPort)
// Parse the proxy URL
...
然后,您需要修改 HTTP 客户端的传输,配置为忽略验证代理服务器的 SSL/TLS 证书。首先将 crypto/tls
软件包添加到您的导入中。然后在解析代理 URL 后,将 http.Transport
的定义替换为以下代码段:
...
func ScrapeWithGoquery() {
// Parse the proxy URL
...
//Create an http.Transport that uses the proxy
transport := &http.Transport{
Proxy: http.ProxyURL(proxyURL),
TLSClientConfig: &tls.Config{
InsecureSkipVerify: true, // Disable SSL certificate verification
},
}
// Create an HTTP client with the transport
...
此代码段修改了 HTTP 客户端,将传输配置为使用本地代理服务器。确保将 IP 地址替换为代理服务器的 IP 地址。
然后使用项目目录中的以下命令运行此实现:
go run brightdata/goquery.go
输出结果应如下所示:
$ go run brightdata/goquery.go
Article 0: {"category":"Newsletter ✉️","excerpt":"Open source rival to Twitter, a hyped new terminal and a cool new Brave/Chrome feature among many other things.","title":"FOSS Weekly #24.07: Fedora Atomic Distro, Android FOSS Apps, Mozilla Monitor Plus and More"}
Article 1: {"category":"Explain","excerpt":"Intel makes things confusing, I guess. Let's try making the processor naming changes simpler.","title":"Intel Processor Naming Changes: All You Need to Know"}
Article 2: {"category":"Linux Commands","excerpt":"The Cut command lets you extract a part of the file to print without affecting the original file. Learn more here.","title":"Cut Command Examples"}
Article 3: {"category":"Raspberry Pi","excerpt":"A UART attached to your Raspberry Pi can help you troubleshoot issues with your Raspberry Pi. Here's what you need to know.","title":"Using a USB Serial Adapter (UART) to Help Debug Your Raspberry Pi"}
Article 4: {"category":"Newsletter ✉️","excerpt":"Damn Small Linux resumes development after 16 years.","title":"FOSS Weekly #24.06: Ollama AI, Zorin OS Upgrade, Damn Small Linux, Sudo on Windows and More"}
Article 5: {"category":"Tutorial","excerpt":"Zorin OS now provides a way to upgrade to a newer major version. Here's how to do that.","title":"How to upgrade to Zorin OS 17"}
Article 6: {"category":"Ubuntu","excerpt":"Learn the logic behind each step you have to follow for adding an external repository in Ubuntu and installing packages from it.","title":"Installing Packages From External Repositories in Ubuntu [Explained]"}
Article 7: {"category":"Troubleshoot 🔬","excerpt":"Getting a warning that the boot partition has no space left? Here are some ways you can free up space on the boot partition in Ubuntu Linux.","title":"How to Free Up Space in /boot Partition on Ubuntu Linux?"}
Article 8: {"category":"Ubuntu","excerpt":"Wondering which Ubuntu version you’re using? Here’s how to check your Ubuntu version, desktop environment and other relevant system information.","title":"How to Check Ubuntu Version Details and Other System Information"}
Check Proxy IP map[asn:map[asnum:7922 org_name:COMCAST-7922] country:US geo:map[city:Crown Point latitude:41.4253 longitude:-87.3565 lum_city:crownpoint lum_region:in postal_code:46307 region:IN region_name:Indiana tz:America/Chicago] ip:73.36.77.244]
您会注意到,尽管抓取的是相同的文章,但代理 IP 检查返回的信息不同,这表明您从不同的位置或国家/地区上网。
Colly 用于 Bright Data 住宅代理
尽管 Colly 未提供以编程方式禁用 SSL/TLS 验证的方法,但它确实允许自行提供 传输
以供其 HTTP 客户端使用的一种方法。
在编辑器或 IDE 中打开 colly.go
文件后,在 ScrapeWithColly()
函数中初始化新收集器后,粘贴以下几行代码(切勿忘记添加 net/url
和 net/http
导入):
...
func ScrapeWithColly() {
...
//Create an http.Transport that uses the proxy
transport := &http.Transport{
TLSClientConfig: &tls.Config{
InsecureSkipVerify: true, // Disable SSL certificate verification
},
}
// Set the collector instance to use the configured transport
c.WithTransport(transport)
...
此代码段定义了禁用 SSL 验证的 HTTP 传输,并使用 Colly WithTransport()
的方法为网络请求设置收集器的传输。
修改 proxyStr
变量以包含住宅代理证书(与使用 goquery 的做法一样)。将 proxyStr
行替换为以下代码段:
...
// Define the proxy server with username and password
proxyUsername := "username" //Your residential proxy username
proxyPassword := "your_password" //Your Residential Proxy password here
proxyHost := "server_host" //Your Residential Proxy Host
proxyPort := "server_port" //Your Port here
proxyStr := fmt.Sprintf("http://%s:%s@%s:%s", url.QueryEscape(proxyUsername), url.QueryEscape(proxyPassword), proxyHost, proxyPort)
...
不要忘记将字符串值替换为住宅代理 访问参数 页面中的值。
接下来,使用项目目录中的以下命令运行此实现:
go run brightdata/colly.go
go run brightdata/colly.go
…
Check Proxy IP map[asn:map[asnum:2856 org_name:British Telecommunications PLC] country:GB geo:map[city:Turriff latitude:57.5324 longitude:-2.3883 lum_city:turriff lum_region:sct postal_code:AB53 region:SCT region_name:Scotland tz:Europe/London] ip:86.180.236.254]
在输出的 “检查代理 IP” 部分中,即便使用相同凭据,您仍会注意到国家/地区的变更。
Selenium 用于 Bright Data 住宅代理
使用 Selenium 时,您必须修改 selenium.Proxy{}
的定义,以将凭据用于代理 URL 字符串。将当前的代理定义替换为以下内容:
...
// Define the proxy server with username and password
proxyUsername := "username" //Your residential proxy username
proxyPassword := "your_password" //Your Residential Proxy password here
proxyHost := "server_host" //Your Residential Proxy Host
proxyPort := "server_port" //Your Port here
proxyStr := fmt.Sprintf("http://%s:%s@%s:%s", url.QueryEscape(proxyUsername), url.QueryEscape(proxyPassword), proxyHost, proxyPort)
// Define proxy settings
proxy := selenium.Proxy{
Type: selenium.Manual,
HTTP: proxyStr,
SSL: proxyStr,
}
...
不要忘记导入
net/url
软件包。
此代码段定义了各种代理参数,并在合并后创建了代理配置中使用的代理 URL。
现在,需要为 Chrome WebDriver 配置选项,以便在使用住宅代理时禁用 SSL 验证,就像之前的实现一样。为此,请修改 chromeCaps
的定义参数以包括 --ignore-certificate-errors
选项,如下所示:
...
caps.AddChrome(chrome.Capabilities{Args: []string{
"--headless=new", // Start browser without UI as a background process
"--ignore-certificate-errors", // // Disable SSL certificate verification
}})
...
默认情况下,Selenium 不支持经过身份验证的代理配置。但是,可以使用一个小软件包为经过身份验证的代理连接构建 Chrome 扩展程序来解决这个问题。
首先,使用此 go get
命令将软件包添加到您的项目中:
go get https://github.com/rexfordnyrk/proxyauth
然后,通过将 “github.com/rexfordnyrk/proxyauth”
行添加到文件顶部的导入块中,将软件包导入 brightdata/selenium.go
文件。
接下来,需要使用 proxyauth 软件包中的 BuildExtension()
方法构建 Chome 扩展程序,并将其与 Bright Data 住宅代理证书一起传输。为此,请在 chromeCaps
定义之后但在 caps.AddChrome(chromeCaps)
行之前粘贴以下代码段:
…
//Building proxy auth extension using BrightData Proxy credentials
extension, err := proxyauth.BuildExtention(proxyHost, proxyPort, proxyUsername, proxyPassword)
if err != nil {
log.Fatal("BuildProxyExtension Error:", err)
}
//including the extension to allow proxy authentication in chrome
if err := chromeCaps.AddExtension(extension); err != nil {
log.Fatal("Error adding Extension:", err)
}
…
此代码段创建了一个 Chrome 扩展程序并将其添加到 Chrome WebDriver 中,以便通过所提供的代理凭据启用经过身份验证的 Web 请求。
您可以使用项目目录中的以下命令运行此实现:
go run brightdata/selenium.go
输出结果应如下所示:
$ go run brightdata/selenium.go
Article 0: {"categoryText":"Newsletter ✉️","excerpt":"Check out the promising new features in Ubuntu 24.04 LTS and a new immutable distro.","title":"FOSS Weekly #24.08: Ubuntu 24.04 Features, Arkane Linux, grep, Fedora COSMIC and More"}
…
Article 8: {"categoryText":"Ubuntu","excerpt":"Wondering which Ubuntu version you’re using? Here’s how to check your Ubuntu version, desktop environment and other relevant system information.","title":"How to Check Ubuntu Version Details and Other System Information"}
Check Proxy IP {"ip":"176.45.169.166","country":"SA","asn":{"asnum":25019,"org_name":"Saudi Telecom Company JSC"},"geo":{"city":"Riyadh","region":"01","region_name":"Riyadh Region","postal_code":"","latitude":24.6869,"longitude":46.7224,"tz":"Asia/Riyadh","lum_city":"riyadh","lum_region":"01"}}
再一次,如果您查看输出底部的 IP 信息,会发现发送请求时也使用了不同的国家/地区。这是 Bright Data 的代理轮换系统 所起到的作用。
如您所见,在 Go 应用程序中可以轻松使用 Bright Data。首先,您需要在 Bright Data 平台上创建住宅代理并获取您的凭据。其次,您使用该信息修改代码以便使用网络代理。
结语
Web 代理服务器是为互联网用户交互量身定制的关键组件。在本文中,您了解了有关代理服务器以及如何使用 Squid 设置自托管代理服务器的所有信息。您还了解到如何将本地代理服务器集成到 Go 应用程序中,在本例中这是一种网络抓取工具。
如果您有兴趣使用代理服务器,应考虑使用 Bright Data。其最先进的代理网络可以协助您快速收集数据,而不必担心任何额外的基础架构或维护。