June 03, 2011

Liberating data using Scraper Wiki

Of all the wiki sites that sprung up after the original, one of the most useful and positively cool is ScraperWiki. Scraper wiki is an attempt to liberate data from websites and pdfs and instead populate spreadsheets with them.

There is a lot of data available on the net. But its value is severely limited by the fact that you cannot do much more than just browsing it. When you move data from a html page or a pdf file into a spreadsheet, suddenly the value of the data goes up many fold. Now you can analyze the data, sort it, look for trends and coax information out of it. ScraperWiki aids in the first step by scraping web pages and moving data into usable data sets.

ScraperWiki is two things. First it is a web-based compiler and reusable libraries (in Python, Ruby or PHP) that allows you to write and run a scraper. Second, it is a wiki store of scrapers written by others that you can then update, reuse or just run to get data.

There are quite a few interesting scrapers. This scraper collects data from weather stations across all of Germany, while this one collects the Location IDs from Weather.com URLs. Weather is not all scrapers do, this one for example collects basic info about all MLB players, while this one is an massive database of all soccer WorldCup matches.

Of all the untold millions spent by governments and corporations on digitizing their data and making web pages, a decent portion that went towards making html tables out of data sets. ScraperWiki is an attempt to reverse that. Cheers to liberating data from the shackles of the web.

1 comment:

Anonymous said...

If you are looking for a cheap and very efficient web screen scraper - here you are! SmokeDoc will extract almost any type of data and help you to use it according to your purposes!

http://smokedoc.org