TECSmith

Multi-Page Web Scraper

Web Scraper Process
Website Scraper

In this example the utility of the headless Google Chrome web browser — Puppeteer — to extract certain data elements from a Google Partners web page was quite simple, yet incredibly powerful, as you will see from the source code provided.

What is Puppeteer?

Puppeteer is a Node library which provides a high-level API to control Chromium or Chrome over the DevTools Protocol.
The Puppeteer API is hierarchical and mirrors the browser structure.

NOTE On the following diagram, faded entities are not currently represented in Puppeteer.

Installing Puppeteer

To install Puppeteer we need to install Node and then write some boilerplate JavaScript (source code provided below) to talk through Node and control the Google Chrome web browser using the puppeteer library. Puppeteer requires Node v7.6.0 or above.

ScrapeGooglePartners.js

#!/usr/bin/env node

/*
	A creation by Thomas EC. Smith of www.WeBots.London using Google Puppeteer!
	Scraping the Google partners webpage for their partner information.
	
	The goal is to extract the partner names and their company logo from all pages,
	then output this information in an easily readable format.
*/

const puppeteer = require('puppeteer');
var fs = require('fs');

(async () => {
	
	// On your marks... Get set... Go!
	console.time('Total Scrape Time');
	
	// Extract partners on the page, recursively check the next page.
	const extractPartners = async (url) => {
		
		// How are we doing so far? Print our progress!
		console.log(`Scraping: ${url}`)
		
		// Define the content we want to scrape...
		const page = await browser.newPage();
		await page.goto(url);
		const partnersOnPage = await page.evaluate(() =>
		Array.from(document.querySelectorAll("div.compact"))				// Target ID
			.map(compact => ({
				title: compact.querySelector('h3.title').innerText.trim(), 	// Partner name
				logo: compact.querySelector('.logo img').src				// Partner logo
			}))
		);
		await page.close();
		
	// Recursively scrape the next page, if there is one!
	if (partnersOnPage.length < 1) {
		
		// Print our final destination and the time we took...
		console.log(`terminated recursion on: ${url}`)
		console.timeEnd('Total Scrape Time');
		
		// Quit recursion because we're done!
		return partnersOnPage
		
	} else {
		
		// Don't stop now, go and fetch the next page...
		const nextPageNumber = parseInt(url.match(/page=(\d+)$/)[1], 10) + 1; // ?page=X+1
		const nextUrl = `https://marketingplatform.google.com/about/partners/find-a-partner?
		page=${nextPageNumber}`;
		
		return partnersOnPage.concat(await extractPartners(nextUrl))
		}
	}
	
	// Headless parameters launch the mission!
	const browser = await puppeteer.launch();
	const firstUrl = "https://marketingplatform.google.com/about/partners/find-a-partner?page=40" // Not starting from page 1 because it's long!
	const partners = await extractPartners(firstUrl)
	
		// Store the data results in to a text file.
		fs.writeFile('GooglePartners.json', JSON.stringify(partners), function(err) { 
			if (err)
				console.log('Shit fuck! We failed to export the data!');
	
			else
				console.log('All data successfully exported - Have fun!');
		});
	// console.log(partners); // Jizz the results out to screen!

    await browser.close();

})(); // ScrapeGooglePartners.js

Ensuring that you have both ‘package.json’ and ‘ScrapeGooglePartners.js’ within the same directory and Node is correctly installed, simply execute the following in Command Prompt;

package.json

{
  "name": "Scrape_Google_Partners",
  "version": "0.0.1",
  "description": "",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "author": "Thomas EC. Smith",
  "license": "ISC",
  "dependencies": {
    "puppeteer": "^1.3.0"
  }
}
npm install
node ScrapeGooglePartners