Source content from anywhere with ScrapingBee

Pierre de Wulf, co-founder of ScrapingBee joined us on yesterday’s unauthorized and rum-fueled treasure hunt in the sharky waters around the Gatsby islands.

The What?

Source Crowdcast webinars into the Gatsby Data Layer using ScrapingBee — an API that simplifies web scraping.

The Why?

There is no official Crowdcast API and keeping the data in sync using copy/past is a pain (or at least boring 🤪). To scrape from Crowdcast, we need to load the page in a headless browser such as Puppeteer. Doing so is not possible as part of the Gatsby build process, so we outsource it to ScrapingBee.

The How

We used the Data Extraction-feature from ScrapingBee. It lets us select data on a page using CSS-selector. It felt kinda similar to cheerio if you have ever used that.

As always, I started with copy/pasting the example snippets. It worked almost out of the box, but we had to make use of the wait_for option as the Crowdcast page takes a while to load:

It’s sometimes necessary to wait for a particular element to appear in the DOM before ScrapingBee returns the HTML content. ScrapingBee Docs

The Code

const axios = require("axios");

const scrapeCrowdcast = async () => {
  const { data } = await axios.get("https://app.scrapingbee.com/api/v1", {
    params: {
      api_key: process.env.SCRAPING_BEE_API_KEY,
      url: "https://www.crowdcast.io/raae",
      // Wait for there to be at least one
      // non-empty .event-tile element
      wait_for: ".event-tile",
      extract_rules: {
        webinars: {
          // Lets create a list with data
          // extracted from the .event-tile element
          selector: ".event-tile",
          type: "list",
          // Each object in the list should
          output: {
            // have a title lifted from
            // the .event-tile__title element
            title: ".event-tile__title",
            // and a path lifted from
            // the href attribute of the first link element
            path: {
              selector: "a",
              output: "@href",
            },
          },
        },
      },
    },
  });

  return data;
};

The resulting data object:

{
  webinars: [
    {
      title: "5 Gatsby Gotchas to look out for as a React developer",
      path: "/e/gatsby-gotchas-react?utm_source=profile&utm_medium=profile_web&utm_campaign=profile",
    },
    {
      title: "Testing your Gatsby Serverless Functions",
      path: "/e/testing-your-functions?utm_source=profile&utm_medium=profile_web&utm_campaign=profile",
    },
    // and more
  ];
}

We then loop through the webinars on the data object creating content nodes:

// gatsby-node.js
exports.sourceNodes = async (gatsbyUtils) => {
  const { actions, createNodeId, createContentDigest } = gatsbyUtils;
  const { createNode } = actions;

  const data = await scrapeCrowdcast();

  for (const webinar of data.webinars) {
    createNode({
      id: createNodeId(webinar.path),
      title: webinar.title,
      url: "https://www.crowdcast.io" + webinar.path,
      rawScrape: webinar,
      internal: {
        type: `CrowdcastWebinar`,
        mediaType: `text/json`,
        content: JSON.stringify(webinar),
        contentDigest: createContentDigest(webinar),
      },
    });
  }
};

And voila, we have webinar nodes in our data layer:

query MyQuery {
  allCrowdcastWebinar {
    nodes {
      title
      url
    }
  }
}

To see the entire demo, check out its GitHub repository.

Should we make this a full-featured plugin? Extracting the cover art and descriptions from the individual webinar pages? Let me know!

All the best,
Queen Raae

PS: ScrapingBee is a paid service, but we are as always neither sponsored nor an affiliate.
PPS: If you want to learn more about web-scraping without ScrapingBee check out their article Web Scraping with Javascript and NodeJS.

Source content from anywhere with ScrapingBee

The What?

The Why?

The How

The Code

Source data (avatars) into the Gatsby Data Layer from a Supabase table →

How to create unique Gatsby node ids →

No need to create the content digest yourself →