Wednesday, September 2, 2015

Scraping Data From A Reluctant PDF

PDFs and civic hackers are not good friends.

The post “My First Hackathon: Learning to Extract” explains the civic hackers' animosity toward PDFs (portable document format computer files) this way:
Over the weekend of January 16-18 [2014] the Sunlight Foundation sponsored a PDF Liberation Hackathon which I participated in as a non-developer. The event focused on the problem that many organizations have reams of data in PDF. This is an issue because PDF does not allow the user to interact with the document in word searches and other ways; the data they contain is essentially locked up and difficult to manipulate. This means that a lot of data, including congressional financial disclosures and non-profit expenditures, is not easily viewable, and cannot be used to create tables, graphs, diagrams, and apps. The goal of the event was to make progress in unlocking these documents by testing and tweaking different software to convert documents into more usable formats...”
This excerpt from “Beyond Transparency: Open Data and the Future of Civic Innovation” describes a specific case where civic hacking needs to be able to work with public data that is only available in PDF format:
“...to realize the full potential requires more than simply declaring a dataset open and putting a PDF version on a website. First of all, the data must be not only open and available, but also in a useful (and preferably machine-readable) format... a list of crime reports in an Excel format is not that helpful for a parent trying to understand whether the route her child takes to school every day is safe. But when that list of crime incidents is mapped, the information becomes much more consumable. The data become even more useful when the parent can input his child’s route to school and a system displays only the crimes reported within a five block radius of that route...”
Getting information or data from PDFs for civic hacking is a subset of the ‘scraping’ issue I talked about in “Intro To Data Scraping For Civic Hackers.” But because of how much government information is tied up in PDFs, getting data from PDFs is, for many civic hackers, synonymous with ‘scraping data’ Scraping PDFs is so important for open data and civic hacking that the Sunlight Foundation and others organized the PDF Liberation Hackathon in 2014.

Two blog posts from the Sunlight Foundation, “PDF Liberation: Why it matters and how you can help” and “PDF Liberation Hackathon and the need for more civic innovation,” cover the 2014 hackathon activities and planned future work. That hackathon generated a few worthwhile projects, but even more importantly, it connected many people around the US and around the world who are interested in scraping PDFs. If you want to know what’s happening with PDF liberation tools or want to work on improving those tools, check out the PDF Liberation GitHub page.

If you’re interested in the specifics of converting PDF information into a usable format, do a Google search with the keywords specific to your situation or check out the following resources:

  1. "Five tools to extract "locked" data in PDFs" -- this is from 2013, but I couldn’t find a more recent list of data-from-PDF tools. 
  2. "How to Extract Data from Tables in PDFs with Tabula and OpenRefine" -- one of the five tools mentioned in the previous item was Tabula. In addition to showing how to use Tabula, this post highlights data quality issues caused by scraping PDFs and shows how to use OpenRefine to improve the data quality.
  3. PDF Liberation Hackathon website -- scroll down on this website and you’ll find scads of tools listed.
  4. ScraperWiki blog -- search for PDF on the ScraperWiki blog and you’ll find lots of info related to digging data out PDFs, such as “The four kinds of data PDF.”
  5. Get Started With Scraping – Extracting Simple Tables from PDF Documents” -- this post is for coders; it has lots of details about using ScraperWiki / Python for scraping a PDF.
  6. Last chance saloon: Manually converting a PDF to .csv format” -- This post walks you through the painful process of manually building a csv data file (comma separated values) from a PDF that is resistant to all the scraping tools you try.

In my mind, future civic hacking work on the issue of PDFs has two aspects. The first aspect is working with government agencies to move toward making more data available in a standard open data non-PDF format. If the government has specific reasons for using the PDF format, they should also provide information that’s agreed to be important for civic hacking in an open format on an ‘open data’ webpage.

The second aspect of working on PDFs is for civic hackers to become skilled with PDF-scraping tools and to continue  improving tools for extracting data from PDFs. We’re likely to see some interesting or useful government data available only in PDF format for many years, so skill in using these tools and better tools are both important.

I’m going to propose to the Appleton Makerspace Coder Cooperative and the Fox Valley Python User Group that the two groups do a collaborative PDF-scraping session to walk people through the process of extracting data from a PDF. If that topic is of interest to a significant number of people, maybe we can even do a session where we use several different tools to extract the same data from a PDF and document pros and cons for each tool. If those PDF-scraping sessions are held, I’ll work with the coders at the sessions to write up one or several posts documenting our work.

For NE Wisconsin civic hackers and other people interested in extracting data from PDFs, let’s reserve March 5, 2016 on our schedules for an Open Data Day Hackathon (this seems to be the successor to the 2014 PDF Liberation Hackathon). Before March 2016, we should also schedule a couple data-scraping workshops and meetups to develop an 'Intro To PDF Data Liberation' session. We can have a March 5 hackathon track with that Intro session for participants who want to learn PDF data-scraping, as well as offering the session at other times if there’s interest.

Many journalists are interested in extracting information from PDFs, as mentioned in the DHMN Civic Hacks post "More Civic Hackers In NE Wisconsin." So one of my goals for the March 5 hackathon will be to make sure NE Wisconsin journalists are all aware of the opportunity to learn more about scraping PDFs!

People who read all the way through this post may look at PDFs differently now and wonder how hard it would be to scrape its data!  :)

*****

No comments:

Post a Comment