如何使用 Python 爬取 Shopify 商店

利用 products.json 和高效的爬取方法简化 Shopify 数据提取。
5 min read
如何爬取 Shopify 博客图片

表面上看,Shopify 商店在数据提取方面似乎是最具挑战性的之一。下面展示的产品就是一个典型的 Shopify 列表。其数据结构相当嵌套。

<div class="site-box-content product-holder"><a href="/collections/ready-to-ship/products/the-eira-straight-leg" class="product-item style--one alt color--light   with-secondary-image " data-js-product-item="">

  <div class="box--product-image primary" style="padding-top: 120.00048000192001%"><img src="//hiutdenim.co.uk/cdn/shop/files/Hiut-EiraEcru-01_91c15dbe-7412-47b6-8f76-bdb434199203.jpg?v=1731517834&amp;width=640" alt="The Eira - Organic Ecru" srcset="//hiutdenim.co.uk/cdn/shop/files/Hiut-EiraEcru-01_91c15dbe-7412-47b6-8f76-bdb434199203.jpg?v=1731517834&amp;width=360 360w,//hiutdenim.co.uk/cdn/shop/files/Hiut-EiraEcru-01_91c15dbe-7412-47b6-8f76-bdb434199203.jpg?v=1731517834&amp;width=420 420w,//hiutdenim.co.uk/cdn/shop/files/Hiut-EiraEcru-01_91c15dbe-7412-47b6-8f76-bdb434199203.jpg?v=1731517834&amp;width=480 480w,//hiutdenim.co.uk/cdn/shop/files/Hiut-EiraEcru-01_91c15dbe-7412-47b6-8f76-bdb434199203.jpg?v=1731517834&amp;width=640 640w,//hiutdenim.co.uk/cdn/shop/files/Hiut-EiraEcru-01_91c15dbe-7412-47b6-8f76-bdb434199203.jpg?v=1731517834&amp;width=840 840w,//hiutdenim.co.uk/cdn/shop/files/Hiut-EiraEcru-01_91c15dbe-7412-47b6-8f76-bdb434199203.jpg?v=1731517834&amp;width=1080 1080w,//hiutdenim.co.uk/cdn/shop/files/Hiut-EiraEcru-01_91c15dbe-7412-47b6-8f76-bdb434199203.jpg?v=1731517834&amp;width=1280 1280w,//hiutdenim.co.uk/cdn/shop/files/Hiut-EiraEcru-01_91c15dbe-7412-47b6-8f76-bdb434199203.jpg?v=1731517834&amp;width=1540 1540w,//hiutdenim.co.uk/cdn/shop/files/Hiut-EiraEcru-01_91c15dbe-7412-47b6-8f76-bdb434199203.jpg?v=1731517834&amp;width=1860 1860w,//hiutdenim.co.uk/cdn/shop/files/Hiut-EiraEcru-01_91c15dbe-7412-47b6-8f76-bdb434199203.jpg?v=1731517834&amp;width=2100 2100w,//hiutdenim.co.uk/cdn/shop/files/Hiut-EiraEcru-01_91c15dbe-7412-47b6-8f76-bdb434199203.jpg?v=1731517834&amp;width=2460 2460w,//hiutdenim.co.uk/cdn/shop/files/Hiut-EiraEcru-01_91c15dbe-7412-47b6-8f76-bdb434199203.jpg?v=1731517834&amp;width=2820 2820w" sizes="(max-width: 768px) 50vw, (max-width: 1024px) and (orientation: portrait) 50vw, 25vw " loading="lazy" class="lazy lazyloaded" data-ratio="0.8" width="3200" height="4000" onload="this.classList.add('lazyloaded')"><span class="lazy-preloader " aria-hidden="true"><svg class="circular-loader" viewBox="25 25 50 50"><circle class="loader-path" cx="50" cy="50" r="20" fill="none" stroke-width="4"></circle></svg></span></div><div class="box--product-image secondary" style="padding-top: 120.00048000192001%"><img src="//hiutdenim.co.uk/cdn/shop/files/Hiut-EiraEcru-02.jpg?v=1731517834&amp;width=640" alt="The Eira - Organic Ecru" srcset="//hiutdenim.co.uk/cdn/shop/files/Hiut-EiraEcru-02.jpg?v=1731517834&amp;width=360 360w,//hiutdenim.co.uk/cdn/shop/files/Hiut-EiraEcru-02.jpg?v=1731517834&amp;width=420 420w,//hiutdenim.co.uk/cdn/shop/files/Hiut-EiraEcru-02.jpg?v=1731517834&amp;width=480 480w,//hiutdenim.co.uk/cdn/shop/files/Hiut-EiraEcru-02.jpg?v=1731517834&amp;width=640 640w,//hiutdenim.co.uk/cdn/shop/files/Hiut-EiraEcru-02.jpg?v=1731517834&amp;width=840 840w,//hiutdenim.co.uk/cdn/shop/files/Hiut-EiraEcru-02.jpg?v=1731517834&amp;width=1080 1080w,//hiutdenim.co.uk/cdn/shop/files/Hiut-EiraEcru-02.jpg?v=1731517834&amp;width=1280 1280w,//hiutdenim.co.uk/cdn/shop/files/Hiut-EiraEcru-02.jpg?v=1731517834&amp;width=1540 1540w,//hiutdenim.co.uk/cdn/shop/files/Hiut-EiraEcru-02.jpg?v=1731517834&amp;width=1860 1860w,//hiutdenim.co.uk/cdn/shop/files/Hiut-EiraEcru-02.jpg?v=1731517834&amp;width=2100 2100w,//hiutdenim.co.uk/cdn/shop/files/Hiut-EiraEcru-02.jpg?v=1731517834&amp;width=2460 2460w,//hiutdenim.co.uk/cdn/shop/files/Hiut-EiraEcru-02.jpg?v=1731517834&amp;width=2820 2820w" sizes="(max-width: 768px) 50vw, (max-width: 1024px) and (orientation: portrait) 50vw, 25vw " loading="lazy" class="lazy lazyloaded" data-ratio="0.8" width="3200" height="4000" onload="this.classList.add('lazyloaded')"></div><div class="caption">

    <div>
      <span class="title"><span class="underline-animation">The Eira - Organic Ecru</span></span>
      <span class="price text-size--smaller"><span style="display:flex;flex-direction:row">$285.00</span></span>

    </div><quick-view-product class="quick-add-to-cart">
          <div class="quick-add-to-cart-button">
            <button class="product__add-to-cart" data-href="/products/the-eira-straight-leg" tabindex="-1">
              <span class="visually-hidden">Add to cart</span>
              <span class="add-to-cart__text" style="height:26px" role="img"><svg width="22" height="26" viewBox="0 0 22 26" fill="none" xmlns="http://www.w3.org/2000/svg"><path d="M6.57058 6.64336H4.49919C3.0296 6.64336 1.81555 7.78963 1.7323 9.25573L1.00454 22.0739C0.914352 23.6625 2.17916 25 3.77143 25H18.2286C19.8208 25 21.0856 23.6625 20.9955 22.0739L20.2677 9.25573C20.1844 7.78962 18.9704 6.64336 17.5008 6.64336H15.4294M6.57058 6.64336H15.4294M6.57058 6.64336V4.69231C6.57058 2.6531 8.22494 1 10.2657 1H11.7343C13.775 1 15.4294 2.6531 15.4294 4.69231V6.64336" stroke="var(--main-text)" style="fill:none!important" stroke-width="1.75"></path><path d="M10.0801 12H12.0801V20H10.0801V12Z" fill="var(--main-text)" style="stroke:none!important"></path><path d="M15.0801 15V17L7.08008 17L7.08008 15L15.0801 15Z" fill="var(--main-text)" style="stroke:none!important"></path></svg></span><span class="lazy-preloader add-to-cart__preloader" aria-hidden="true"><svg class="circular-loader" viewBox="25 25 50 50"><circle class="loader-path" cx="50" cy="50" r="20" fill="none" stroke-width="4"></circle></svg></span></button>
          </div>
        </quick-view-product></div><div class="product-badges-holder"></div></a></div>

从上面的 HTML 中提取数据并非不可能,只是有更简单的方式。

Shopify 的首页

https://hiutdenim.co.uk/,他们的首页包含了一些产品信息,但比较有限。一直向下滚动,你会在页面底部找到这些信息。

Shopify 商店首页

第一眼看起来,你似乎需要爬取每一个分类链接,然后依次获取并解析这些不同的网页。Shopify 商店在页面布局上并不遵循传统的电商爬虫方法。不过,还有另一种办法。

Shopify 的 JSON 页面

没看错,所有产品都可以作为 JSON 对象获取。我们甚至不需要BeautifulSoupSelenium

只要给我们的 URL 添加 /products.json 即可。每一个 Shopify 网站都是基于一个 products.json 文件构建的。

Shopify JSON 页面

我们可以直接请求此内容(确实可行),从而获得所需的大部分数据。拿到数据之后,只需筛选想保留的部分即可。可以点击这里验证我们使用的示例站点。

在 Python 中爬取 Shopify

现在我们知道了目标数据在何处,原本看似繁琐的工作立刻变得简单许多。因为只需处理 JSON 数据,所以只需要一个依赖库:Python Requests

pip install requests

单独函数

让我们先看看代码的几个部分。我们的爬虫由三个部分组成。

下面这个函数非常重要,它实际执行爬取逻辑:

def scrape_shopify(url, retries=2):
    """scrape a shopify store"""
    json_url = f"{url}products.json"
    items = []
    success = False
    while not success and retries > 0:
        response = requests.get(json_url)
        try:
            response.raise_for_status()
            products = response.json()["products"]
            for product in products:
                product_data = {
                    "title": product["title"],
                    "tags": product["tags"],
                    "id": product["id"],
                    "variants": product["variants"],
                    "images": product["images"],
                    "options": product["options"]
                }
                items.append(product_data)            
            success = True
        except requests.RequestException as e:
            print(f"Error during request: {e}, failed to get {json_url}")
        except KeyError as key_error:
            print(f"Failed to parse json: {key_error}")
        except json.JSONDecodeError as e:
            print(f"json error: {e}")
        except Exception as e:
            print(f"Unforeseen error: {e}")
        retries-=1
        print(f"Retries left: ", retries)
    return items
  • 首先,我们将 products.json 附加到原始 URL:json_url = f"{url}products.json"
  • 初始化一个空列表 items,在爬取到的产品数据中,我们会将它们都加到这个列表中。最后会返回这个列表。
  • 如果返回了正常的响应,我们就获取 "products" 键来得到所有产品。
  • 我们从每个产品中提取一些具体信息来创建 dict,即 product_data
  • product_data 依次追加到 items 列表中。
  • 这样一直循环解析页面中的所有产品。

我们已经有了一个函数来执行爬取并返回产品列表。接下来,我们需要一个函数来处理这个产品列表并将其写入文件。我们可以使用 CSV,但是由于结构比较嵌套,这里使用 JSON 更合适,可以支持更灵活的数据结构,方便后续分析。

def json2file(json_data, filename):
    """save json data to a file"""
    try:
        with open(filename, "w", encoding="utf-8") as file:
            json.dump(json_data, file, indent=4)
            print(f"Data successfully saved: {filename}")
    except Exception as e:
        print(f"failed to write json data to {filename}, ERROR: {e}")

以上就是我们需要的核心代码。接下来,我们在 main 部分来运行爬虫。

if __name__ == "__main__":
    shop_url = "https://hiutdenim.co.uk/"
    items = scrape_shopify(shop_url)

    json2file(items, "output.json")

整合所有代码

将所有代码放到一起,我们的爬虫就完成了。原本看似复杂的解析操作,通过这种方式,代码量也就大约 50 行左右。

import requests
import json

def json2file(json_data, filename):
    """save json data to a file"""
    try:
        with open(filename, "w", encoding="utf-8") as file:
            json.dump(json_data, file, indent=4)
            print(f"Data successfully saved: {filename}")
    except Exception as e:
        print(f"failed to write json data to {filename}, ERROR: {e}")

def scrape_shopify(url, retries=2):
    """scrape a shopify store"""
    json_url = f"{url}products.json"
    items = []
    success = False
    while not success and retries > 0:
        response = requests.get(json_url)
        try:
            response.raise_for_status()
            products = response.json()["products"]
            for product in products:
                product_data = {
                    "title": product["title"],
                    "tags": product["tags"],
                    "id": product["id"],
                    "variants": product["variants"],
                    "images": product["images"],
                    "options": product["options"]
                }
                items.append(product_data)            
            success = True
        except requests.RequestException as e:
            print(f"Error during request: {e}, failed to get {json_url}")
        except KeyError as key_error:
            print(f"Failed to parse json: {key_error}")
        except json.JSONDecodeError as e:
            print(f"json error: {e}")
        except Exception as e:
            print(f"Unforeseen error: {e}")
        retries-=1
    return items


if __name__ == "__main__":
    shop_url = "https://hiutdenim.co.uk/"
    items = scrape_shopify(shop_url)

    json2file(items, "output.json")

返回的数据

我们的数据会以 JSON 对象数组的形式返回。每个产品都包含 variantsimages 列表。如果使用 CSV 来存储可能会比较麻烦。下面这个片段展示了一个完整产品的示例:

{
        "title": "The Valerie - Organic Denim",
        "tags": [
            "The Valerie",
            "Women"
        ],
        "id": 14874183401848,
        "variants": [
            {
                "id": 54902462808440,
                "title": "UK10-29 / 30",
                "option1": "UK10-29",
                "option2": "30",
                "option3": null,
                "sku": null,
                "requires_shipping": true,
                "taxable": true,
                "featured_image": null,
                "available": true,
                "price": "220.00",
                "grams": 0,
                "compare_at_price": null,
                "position": 1,
                "product_id": 14874183401848,
                "created_at": "2025-01-21T14:04:58+00:00",
                "updated_at": "2025-02-12T17:17:54+00:00"
            },
            {
                "id": 54902462939512,
                "title": "UK12-30 / 32",
                "option1": "UK12-30",
                "option2": "32",
                "option3": null,
                "sku": null,
                "requires_shipping": true,
                "taxable": true,
                "featured_image": null,
                "available": true,
                "price": "220.00",
                "grams": 0,
                "compare_at_price": null,
                "position": 2,
                "product_id": 14874183401848,
                "created_at": "2025-01-21T14:04:58+00:00",
                "updated_at": "2025-02-12T17:17:54+00:00"
            },
            {
                "id": 54902463070584,
                "title": "UK14-32 / 28",
                "option1": "UK14-32",
                "option2": "28",
                "option3": null,
                "sku": null,
                "requires_shipping": true,
                "taxable": true,
                "featured_image": null,
                "available": true,
                "price": "220.00",
                "grams": 0,
                "compare_at_price": null,
                "position": 3,
                "product_id": 14874183401848,
                "created_at": "2025-01-21T14:04:58+00:00",
                "updated_at": "2025-02-12T17:17:54+00:00"
            },
            {
                "id": 54902463496568,
                "title": "UK18-36 / 30",
                "option1": "UK18-36",
                "option2": "30",
                "option3": null,
                "sku": null,
                "requires_shipping": true,
                "taxable": true,
                "featured_image": null,
                "available": true,
                "price": "220.00",
                "grams": 0,
                "compare_at_price": null,
                "position": 4,
                "product_id": 14874183401848,
                "created_at": "2025-01-21T14:04:58+00:00",
                "updated_at": "2025-02-12T17:17:54+00:00"
            }
        ],
        "images": [
            {
                "id": 31828166443078,
                "created_at": "2024-06-17T12:05:49+01:00",
                "position": 1,
                "updated_at": "2024-06-17T12:05:50+01:00",
                "product_id": 14874183401848,
                "variant_ids": [],
                "src": "https://cdn.shopify.com/s/files/1/0065/4242/files/HDC_0723_JapanInd_Valerie_45_3_c547ba8a-681b-4486-8cd7-884000e43302.jpg?v=1718622350",
                "width": 4000,
                "height": 4000
            },
            {
                "id": 31828166541382,
                "created_at": "2024-06-17T12:05:49+01:00",
                "position": 2,
                "updated_at": "2024-06-17T12:05:51+01:00",
                "product_id": 14874183401848,
                "variant_ids": [],
                "src": "https://cdn.shopify.com/s/files/1/0065/4242/files/HDC_0723_JapanInd_Valerie_Back_2_5909adb3-c2ab-4810-8b66-a486e8d827a8.jpg?v=1718622351",
                "width": 4000,
                "height": 4000
            },
            {
                "id": 31828166508614,
                "created_at": "2024-06-17T12:05:49+01:00",
                "position": 3,
                "updated_at": "2024-06-17T12:05:51+01:00",
                "product_id": 14874183401848,
                "variant_ids": [],
                "src": "https://cdn.shopify.com/s/files/1/0065/4242/files/HDC_0723_JapanInd_Valerie_Front_3_4316907a-9fd8-4649-894c-4028877370e1.jpg?v=1718622351",
                "width": 4000,
                "height": 4000
            },
            {
                "id": 31828166475846,
                "created_at": "2024-06-17T12:05:49+01:00",
                "position": 4,
                "updated_at": "2024-06-17T12:05:51+01:00",
                "product_id": 14874183401848,
                "variant_ids": [],
                "src": "https://cdn.shopify.com/s/files/1/0065/4242/files/HDC_0723_JapanInd_Valerie_Side_2_ea21477b-c1ba-4c8a-b75e-75c6427b4977.jpg?v=1718622351",
                "width": 4000,
                "height": 4000
            }
        ],
        "options": [
            {
                "name": "Waist",
                "position": 1,
                "values": [
                    "UK10-29",
                    "UK12-30",
                    "UK14-32",
                    "UK18-36"
                ]
            },
            {
                "name": "Leg Length",
                "position": 2,
                "values": [
                    "30",
                    "32",
                    "28"
                ]
            }
        ]
    },

高级技巧

现实中并不总是一帆风顺,你可能在使用上述爬虫时遇到一些问题,比如需要翻多页或被网站屏蔽等。

分页

爬取大型商店时,经常会遇到分页问题。这时我们想要最大限度地获取每页结果。在 URL 上使用 page=<PAGE_NUMBER> 这个参数就能指定分页。

可以稍作改动,让爬虫函数接受一个分页参数。

def scrape_shopify(url, retries=2):
    """scrape a shopify store"""
    json_url = f"{url}products.json"

然后在 main 函数中相应修改:

if __name__ == "__main__":
    shop_url = "https://www.allbirds.com/"
    PAGES = 3

    for page in range(PAGES):
        items = scrape_shopify(shop_url, page=page+1)

        json2file(items, f"page{page}output.json")

代理集成

有时为了避免爬虫被封,你可能需要使用代理服务。在我们的Shopify 代理中,只需在请求时加上你的认证信息即可。

PROXY_URL = "http://brd-customer-<YOUR-USERNAME>-zone-<YOUR-ZONE>:<YOUR-PASSWORD>@brd.superproxy.io:33335"
proxies = {
    "http": PROXY_URL,
    "https": PROXY_URL
}
response = requests.get(json_url, proxies=proxies, verify="brd.crt")

Bright Data 的其他解决方案

Bright Data 提供了成熟的现成方案,让你无需从零开始构建复杂的爬虫。使用我们的Shopify 专用爬虫,即可轻松提取数据,或者访问我们现成的数据集,这些数据以多种格式提供,可直接开始你的项目。

结论

爬取 Shopify 商店并非难如登天。通过简单添加 products.json 并利用其 API,你就能轻松获取大量详尽的产品数据。甚至不需要任何 HTML 解析器!如果想节省开发时间,可以使用我们现成的爬虫,或者立刻使用我们的数据集。

所有 Bright Data 的产品都提供免费试用,快来注册吧!