Web Scraping Using Nodejs

One of the projects I’m working on requires us to download web pages for some security analysis. Not only we need to download and store the main HTML source code, but also other resources the page pulls (eg: images, stylesheets).

While this can be done with phantomjs for more accurate results, as it also pulls resources via Ajax, for example; we need to do it fast and a bit of compromise is allowed.

Also, we need to allow the analysts to download the source codes archive (HTML + resources) - and ensure there is only one single archive, for the same source codes. And by using zip archive for example, two zip files containing the same files, but with different timestamps are essentially different archives (you can inspect the differences using a hex viewer). Therefore, we need to take special care for this case.

Scraping

This is fairly easy with website-scraper node module npm install website-scraper.

To use:

var scraper = require('website-scraper');

scraper.scrape({
  urls: [ 'http://example.com' ],
  directory: '/your/screenshot/dir',
  request: {
    headers: {
      'User-Agent': 'Mozilla/5.0 (Linux; Androi...'
    }
  }
});

Archiving source codes

We use archiver module for this (npm install archiver).

In short:

var archiver = require('archiver');
var zipFile = fs.createWriteStream('/your/screenshot/dir.zip');

archiver
    .create('zip', {})
    // It's very important we set the same date for each file,
    // to ensure with zip file containing similar files will have
    // the same checksum
    .directory('/your/screenshot/dir', '/', { date: new Date(0) })
    .finalize()
    .pipe(zipFile)
;

Notice how timestamp of the archive is explicitly set, to ensure archive containing the same files, should always produce the same archive.

Full source