Challenge (part two): Web Scraper

The Task:

Write a php application that accepts a URL. Download the page the URL references. The page contents should then be broken into two parts. The first part determines all the different kinds of HTML tags on the page and the frequency counts for each. The second part determines all the different words that aren’t part of the HTML on the page and the frequency counts for each. The results from the two parts should be stored in a database.

There are a number of reasons why someone would want to do this. Part of this challenge is to create re-usable code, but the main aim is to use best practice code and style to achieve the task, in an efficient, understandable and coherent manner.

In part one of this challenge, we built our simple ORM classes for storing well structured entities into a database, and here we are going to build on that, to store the results of our page scraping into a set of tables.

Database Design

Firstly let’s do some database design. We need to store URLs, counts of tags on the page, and counts of words on the page that aren’t part of the HTML markup.

Let’s assume that we are writing this for a small application, and will only be scraping upto a few hundred sites. This allows us to make some assumptions about database capacity and performance considerations, like column widths, choice of database type, column sizes, etc. We will also begin with the assumption that this scraper will only scrape basic HTML pages – any largely dynamic pages (through Javascript or Flash) will not be processed very well, as they tend to offer less fixed HTML up front, with the focus on the browser enriching the page by making subsequent page requests and modifying the page after the initial load.

So we’ve created a database that will allow us to store URLs, Tags and Content values, and the number of times we have seen each. We’ve put in some referential integrity constraints to prevent us from doing silly things by accident, such as trying to enter two different counts for one Tag/Scrape instance.
We could go one step further and create a separate table for the values we scrape, thus achieving third-normal form
within our data structure – but at this point it would be overkill and premature optimisation of our system.
Also, I have not added any extra column indexes to the tables, as we don’t yet have an idea of how the data will be used – we can add these once we have an established working prototype and want to optimise how we are using the results.

System Design

So lets now design our system. Here is some pseudo-code to establish what we’re going to do.

  1. enter URL to connect to
  2. verify that we want to allow the given URL to be connected to
  3. connect to the URL to check if robots are allowed, abort if not
  4. connect to the URL and download the content in full
  5. run an XML parser to analyse and pull apart our downloaded HTML
  6. save the scraper page details to the database
  7. analyse the tags, save the counts to the database
  8. analyse the tag content, save the counts to the database

Verification

The URL that has been passed into our system may not be be in a form that we wish to accept – we may wish to prevent users from using IP addresses, or using a URL with embedded username and password to connect to.

Robots.txt

You’ll have noticed that I’ve included a check for ‘robots’ – This is an internet standard that has been around for many years – you can read up more about it on the robotstxt.org site. I’ve chosen to not scrape sites that have objected to being automatically scraped, using this method. You will see the code below contains checks for this.

XML Parser

We’re going to use the built-in DomDocument parser to parse our HTML. This library seems to be the most appropriate library to use for parsing HTML, as there is a lot of bad HTML in the wild, and this library is fairly fault-tolerant, and easy to use. HTML is not always XML compliant, some XML parsers will fail to parse HTML because of trivial shortcuts that programmers make when writing HTML, like not closing tags properly, or embedding attributes within tags that don’t have an argument, eg. <script src="..." async defer>. This behaviour is not XML compliant.

ORM Entities

Our system has 3 entity classes, TPage, TType and TCount. These classes will extend the abstract class AbstractEntity and will be read from and written to the database using a class named EntityHandler – this will take care of the heavy lifting and database interactions.
Lets see what they look like…

The Code

Here’s the class definition for our scraper. It contains all the features we’ve described above, commented and ready to be used.

Saving the Results

This is a basic working prototype for a web-page scraper. It accepts a URL, processes the given page, and counts the number of Tag types and Words used in the page. In order to store these results, we create instances of our AbstractEntity classes, and save them to the database using our EntityHandler class.

Alternatives

There’s a world of options out there for page scraping – it is after all, how search engines operate. They connect to a website, pull useful information from it, which includes links to other pages or websites, and then connect to those as well. This tutorial was written in a couple of days by a single developer – and the simplicity of the classes reflect that.

Enjoy :)

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.