Screen Scraping (Harvesting)

Posted by Methylated on January 01, 2009

Scraping, also known as **screen scraping**, **web scraping**, **extracting**, or **harvesting**, will, in this context, mean harvesting data from one or multiple websites. Data can be anything the website(s) offer. Be it snagging all the search engine results from Google, downloading all the images from a specific gallery, or pulling in dynamic data on an interval (stock quotes, user comments, etc).

## Science of Scrapping ##

Let’s begin with some theory ([skip?](#)). Very basically, your browser is a program that downloads (or **reads**) data from an external source (also known as a **resource**). Since you’re reading this, your browser has downloaded a copy of this website’s HTML/images/CSS code that is being hosted somewhere in the world (New Jersey, in this case).

The basic idea behind scraping is telling the computer to download only specific data from a specific place. Simple enough in theory, but it can get complicated because different websites are structurally different, which means, if you’re being very specific about where certain data is located, that scraper must be updated if the data changes.

This seems like a nightmare, and sometimes it can be, but we’ve come far from the PHP/Perl/Regular Expression days of scraping. While using specific regular expressions to pull data from a body of content is alive and well, it’s tedious and hard to maintain. There have been numerous advancements in **the science of scrapping**.

## Technik Le Scrape ##

Unfortunately, I’m not aware of any particularly useful tools or apps that can do most of the scrapping for you. This means that you will need at least some technical literacy, primarily with programming; Though nothing beyond the basics is required.

### Scrapping Bare: Old Fashioned Way ###

The traditional and probably most popular way of stealing collecting data is using Regular Expressions (from here on in known as **regex**) against the raw content you’re scanning. This means writing very specific patterns that can pick out certain strings in the content.

This is actually kinda fun, which is probably the only reason I still use it from time to time. Also, being that it’s tedious and low-level, it gives you **cool points** (bragging rights), or makes you look like an amateur, depending on who you ask. Regular Expressions are very useful and I highly recommend learning them. Nearly every programming language today has a regex library, with some even sporting built-in support. Your editor should also support regex, else you need (a new editor)[http://vim.org].

**Further Reading**
- [http://en.wikipedia.org/wiki/Regular_expression](http://en.wikipedia.org/wiki/Regular_expression)
- [http://www.regular-expressions.info/](http://www.regular-expressions.info/)

### X Marks The Spot ###
A step up from regular expressions is using Xpath (XML Path). This is a “language” that lets you select or specify specific areas in the document being scanned. This makes it a breeze to scan websites, RSS feeds, and anything with a node/hierarchy structure, so it wouldn’t work for some data (log files for instance).

This is pretty simple stuff. You ignore the fact that the content can be hideously formatted, and only focus on the XHTML structure of it. Think of the (DOM)[http://en.wikipedia.org/wiki/Document_Object_Model] if you’re familiar with Javascript. Similar concept. You traverse nodes to find the data you’re looking for.

**Further Reading**
- [http://en.wikipedia.org/wiki/XPath](http://en.wikipedia.org/wiki/XPath)
- [http://www.w3schools.com/XPath/default.asp](http://www.w3schools.com/XPath/default.asp)

### AHpricot ###
Hpricot is awesome.

### The Big Guns ###
sCRUBYt

Trackbacks

Use this link to trackback from your own site.

Discuss.

Questions? concerns? corrections? Contact us!