New Reporting Tool DocHive Extracts Data from Documents
Posted: 4/4/2013  |  By: Nu Yang
Collecting data for stories can be time-consuming — going through numbers and plugging them into spreadsheets — but two brothers (one an editor and the other a full-time programmer) say they have a solution.  

Charles C. Duncan Pardo is the founding editor of the Raleigh (N.C.) Public Record, an online nonprofit news organization dedicated to local public service and watchdog journalism. He is also a part-time graduate student at Duke University, where he is creating his own journalism program.  

While covering campaign finance returns, Duncan said the small news operation encountered a big problem. The county’s board of elections filed its returns on paper, scanned them, and posted them online as PDF images.  

“It made it impossible to do any analysis and reporting,” he said, adding that each piece of data had to be reviewed and entered into a spreadsheet by hand. The news organization didn’t have the staff or budget to hire anyone to do the data entry.  

In 2010, Duncan said he and his brother, Edward, started to throw around ideas on how to solve this problem. Last summer, Duncan said his brother had an “epiphany.” Why not break the images into components?  

The two brothers created an open-source program called DocHive, which pulls data from documents and enters the information into a spreadsheet. Duncan said DocHive converts the PDF into an image file, then uses a template to break the page into smaller sections. The software uses optical character recognition technology to read numbers and words, and inserts them into a spreadsheet. The user can set the template using XML to capture whatever data is needed.  

A beta version of DocHive was released at the 2013 Computer Assisted Reporting Conference presented by Investigative Reporters and Editors, and the National Institute for Computer Assisted Reporting.  

“NICAR is the exact target of our users,” Duncan said. “We wanted to get contributors who have some great ideas … it was well-timed and gave us a good deadline.”  

Duncan said DocHive can help newspapers access data from flat image files and save them money by not having to hire outside help to go through the data. “It will help us for our upcoming local election year,” he said.