Recently, I had to import data from a Adobe Acrobat file (PDF). Of course, with all those wonderful demos I had seen, my first reflex was to go with Power Query.
I don’t know if the option will come in a future update but for the moment, the PDF file format is not supported.
In case you haven’t heard yet of Power Query, it’s a free (for the moment, at least) add-in for Excel which eases the task of discovering and accessing data from various sources, the most interesting aspect surely being the possibility to catch data from tables on a web page. Power Query can be downloaded from Microsoft: http://www.microsoft.com/en-us/download/details.aspx?id=39379.
But so be it, I found my solution in another Office product: Word 2013.
- Open the PDF document from Word 2013. You can choose to show only the PDF files to speed-up your search. For this demonstration, I use the Canadian Consumer Price Index report for October 2013: http://www.statcan.gc.ca/pub/62-001-x/62-001-x2013010-eng.pdf.
- You can get a message warning you that Word will convert the PDF document in editable Word document. No problem, that’s exactly what we want (my version of Office is in French, so, sorry about that).
- If you get a warning telling you the document is protected, activate the editing.
- Now, save the document and take the time to check that the Web Page (HTML) format is specified. That’s what will help us use the document with Power Query.
- This ends the Word part. Now, start Excel and create a new spreadsheet.
- Select the Power Query tab and choose the « From Web » option.
- A dialog window will ask for an URL address. This URL can be a local file.
- The Navigator will appear on the right, displaying the multiple tables found on that page. For this demo, i chose the third one and clicked on the Load button (located at the bottom of the Navigator)
As of the writing of this article, I was using the November 2013 update of Power Query. In the French version of the document, Power Query detected unwanted characters. Those do not appear in the original HTML document, so we can hope that problem willl be solved in a future version. Here, I simply used the Search & Replace function to remove those. I also corrrected manually some slight imperfections, but still, that was faster than typing everything from scratch. It’s not ideal, I admit it, but it works.
If the table is an image, you’re on your own and the only solution left is to use an optical character recognition software.