Wednesday, August 26, 2015

Intro To Data Scraping For Civic Hackers

Civic hackers often need to use public information that is provided on the web, but the data might not be in a form that is convenient for them to gather or use. When that’s the case, they need to do a bit of data scraping. But what is data scraping, you ask? Rather than trying to clearly explain something I don’t totally understand yet, I’ll let you read the following three ways to explain data scraping.

This Simple Data-Scraping Tool Could Change How Apps Are Made
The number of web pages on the internet is somewhere north of two billion, perhaps as many as double that. It’s a huge amount of raw information. By comparison, there are only roughly 10,000 web APIs–the virtual pipelines that let developers access, process, and repackage that data. In other words, to do anything new with the vast majority of the stuff on the web, you need to scrape it yourself. Even for the people who know how to do that, it’s tedious...”
What is Web Scraping ?
"Web Scraping (also termed Screen Scraping, Web Data Extraction, Web Harvesting etc) is a technique employed to extract large amounts of data from websites. Data from third party websites in the Internet can normally be viewed only using a web browser. Examples are data listings at yellow pages directories, real estate sites, social networks, industrial inventory, online shopping sites, contact databases etc. Most websites do not offer the functionality to save a copy of the data which they display to your local storage. The only option then is to manually copy and paste the data displayed by the website in your browser to a local file in your computer - a very tedious job which can take many hours or sometimes days to complete. Web Scraping is the technique of automating this process, so that instead of manually copying the data from websites, the Web Scraping software will perform the same task within a fraction of the time."
Civic 101 - Getting your city ready for civic hacking (The Basics)
Sometimes, civic innovators will want to use data that isn’t available in a [open] data portal. This may be a table or text on a website. When this happens, a civic hacker can “scrape” the data off the site and enter it into a program. Doing this manually would be time consuming, and so civic innovators have developed programs that do this type of work for them. Scraping is by no means the ideal way to get information as the data usually comes out messy and has to be cleaned up. By releasing data on a data portal, it makes the process much easier.”
This post is just an introduction to the concept of data scraping for civic hackers (by someone who's not an experienced scraper). This is not a guide that explains exactly how to scrape data from a website. I’ll highlight four scraping tools and provide links to websites that do tell you just how to scrape a website for data of interest to you.

Scraper

One tool to look at is the Scraper extension  for the Google Chrome browser. In “Get started with screenscraping using Google Chrome’s Scraper extension” Jens Finnäs says:
How do you get information from a website to a Excel spreadsheet? The answer is screenscraping. There are a number of softwares and platforms (such as OutWit Hub, Google Docs and Scraper Wiki) that helps you do this, but none of them are – in my opinion – as easy to use as the Google Chrome extension Scraper, which has become one of my absolutely favourite data tools. I like to think of a screenscraper as a small robot that reads websites and extracts pieces of information. When you are able to unleash a scraper on hundreds, thousands or even more pages it can be an incredibly powerful tool. In its most simple form, the one that we will look at in this blog post, it gathers information from one webpage only...”
OutWit Hub

OutWit Hub is similar to the Scraper extension but was built for the Firefox browser. A decent overview article for OutWit Hub is “How to Scrape Websites for Data without Programming Skills.” The article author explains that this extension:
“...allows you to point and click your way through different options to extract information from Web pages. When you fire it up, there will be a few simple options along the left sidebar...You’ll see the source for the Web page. The tagged attributes in the source provide markers for certain types of elements that you may want to pull out. Look through this code for the pattern common to the information you want to get out of the website...Once you find the pattern, put the appropriate info in the “Marker before” and “Marker after” columns. Then hit “Execute” and go to town...”
Import.io

The next tool for you to consider is Import.io. “How to scrape data without coding? A step by step tutorial on import.io” explains that Import.io:
“...lets you scrape data from any website into a searchable database. It is perfect for gathering, aggregating and analysing data from websites without the need for coding skills...the idea is to “democratise” data. “We want journalists to get the best information possible to encourage and enhance unique, powerful pieces of work and generally make their research much easier.”...After downloading and opening import.io browser, copy the URL of the page you want to scrape into the import.io browser...”
Kimono

The Wired article linked at the start of this post says that Kimono is:
...a web app that lets you slurp data from any website and turn it instantly into an API. Using a bookmarklet, you highlight the parts of a site you want to scrape and Kimono does the rest. Those with programming chops can take the code Kimono spits out bake it into their own apps; for the code illiterate, Kimono will automatically rework scraped data into a dynamic chart, list, or a simple web app. In essence, it’s a point and click toolkit for taking apart the web, with the aim of letting people build new things with it.
Below are online resources that might be helpful to you as step-by-step guides to scraping data from websites. If I learn about better scraping guides or good alternatives, I’ll update this list with links to those guides.


By reading through this blog post, you should have a pretty good idea of what data scraping is for civic hackers, even if you don't feel like you're an expert at this point. You at least know of a few tools to try out for scraping and know some of the keywords to search for with Google when you need more info.

I’d like to have a future post on this blog with several examples of data scraping from NE Wisconsin city or county websites. Mike Putnam mentioned briefly in a blog post that his AppletonAPI scrapes data from the City of Appleton website. Maybe Mike or another civic hacker will do a post which goes through the scraping process step by step as a guide for others who’d like to do the same type of data scraping.

*****

1 comment: