TECSmith

[Code] Multi-Page Web Scraper

Web Scraper Process

In this example the utility of the headless Google Chrome web browser — Puppeteer — to extract certain data elements from a Google Partners web page was quite simple, yet incredibly powerful, as you will see from the source code provided below;

ScrapeGooglePartners.js

#!/usr/bin/env node

/*
	A creation by Thomas EC. Smith of www.WeBots.London using Google Puppeteer!
	Scraping the Google partners webpage for their partner information.
	
	The goal is to extract the partner names and their company logo from all pages,
	then output this information in an easily readable format.
*/

const puppeteer = require('puppeteer');
var fs = require('fs');

(async () => {
	
	// On your marks... Get set... Go!
	console.time('Total Scrape Time');
	
	// Extract partners on the page, recursively check the next page.
	const extractPartners = async (url) => {
		
		// How are we doing so far? Print our progress!
		console.log(`Scraping: ${url}`)
		
		// Define the content we want to scrape...
		const page = await browser.newPage();
		await page.goto(url);
		const partnersOnPage = await page.evaluate(() =>
		Array.from(document.querySelectorAll("div.compact"))				// Target ID
			.map(compact => ({
				title: compact.querySelector('h3.title').innerText.trim(), 	// Partner name
				logo: compact.querySelector('.logo img').src				// Partner logo
			}))
		);
		await page.close();
		
	// Recursively scrape the next page, if there is one!
	if (partnersOnPage.length < 1) {
		
		// Print our final destination and the time we took...
		console.log(`terminated recursion on: ${url}`)
		console.timeEnd('Total Scrape Time');
		
		// Quit recursion because we're done!
		return partnersOnPage
		
	} else {
		
		// Don't stop now, go and fetch the next page...
		const nextPageNumber = parseInt(url.match(/page=(\d+)$/)[1], 10) + 1; // ?page=X+1
		const nextUrl = `https://marketingplatform.google.com/about/partners/find-a-partner?
		page=${nextPageNumber}`;
		
		return partnersOnPage.concat(await extractPartners(nextUrl))
		}
	}
	
	// Headless parameters launch the mission!
	const browser = await puppeteer.launch();
	const firstUrl = "https://marketingplatform.google.com/about/partners/find-a-partner?page=40" // Not starting from page 1 because it's long!
	const partners = await extractPartners(firstUrl)
	
		// Store the data results in to a text file.
		fs.writeFile('GooglePartners.json', JSON.stringify(partners), function(err) { 
			if (err)
				console.log('Shit fuck! We failed to export the data!');
	
			else
				console.log('All data successfully exported - Have fun!');
		});
	// console.log(partners); // Jizz the results out to screen!

    await browser.close();

})(); // ScrapeGooglePartners.js
Puppeteer Topology

Installing Puppeteer

To install Puppeteer we need to install Node and then write some boilerplate JavaScript (source code provided below) to talk through Node and control the Google Chrome web browser using the puppeteer library. Puppeteer requires Node v7.6.0 or above.

package.json

{
  "name": "Scrape_Google_Partners",
  "version": "0.0.1",
  "description": "",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "author": "Thomas EC. Smith",
  "license": "ISC",
  "dependencies": {
    "puppeteer": "^1.3.0"
  }
}

Ensuring that you have both ‘package.json’ and ‘ScrapeGooglePartners.js’ within the same directory and Node is correctly installed, simply execute the following in Command Prompt;

npm install
node ScrapeGooglePartners