Read and Import Data from PDF Files using Power BI
Data arguably exists as much in stand-alone files as much in database systems. The need to source, link and use this data in different systems is immense. Apart from Microsoft Office file formats, PDF is one of the de-facto standards for publishing files in a standard portable format. The challenge with PDF files is that it can have a variety of content ranging from text, tables, images, hyperlinks etc. Editable PDF files can also have forms which can be used as data submission system for applications that process these files. There are programming frameworks which have libraries that can be used for extracting data from these files. But this again poses the challenge for a non-technical user to simple source data in Power BI reports from his/her PDF file that one would have published typically on a collaboration medium or personal data repository. In this tip, we will look at one of the easiest methods of importing data from PDF files in Power BI Reports.
The new PDF Connector in Power BI enables users to source data from PDF Files.
In this tip, we will go through a series of steps to enable the PDF Connector in Power BI, which is not enabled by default, and then we will read a PDF file and import the data in the report.
The PDF Connector in Power BI is a preview feature as of the publishing of this tip and preview features are not enabled by default, so you may not see the PDF file connector in your data sources list. To verify this, Open the Power BI Desktop, navigate to the Home Menu, and open the Get Data option as shown below.
Click on More and a new dialog with appear. Click on the File option and you would be able to see all the file connectors as shown below. If you have a default installation of Power BI and have not enabled preview features, you will not have any PDF connectors in your list, until this connector becomes a standard feature of Power BI.
So, to enable this connector, we must enable the preview settings for Power BI Desktop. Click on the File menu, click on Options and Settings menu and select Options menu item as shown below.
You will find diverse options in the options dialog box. Click on the preview features menu item, and you will find all the preview features listed here. This list keeps changing and items that are either dropped or promoted as a standard feature may get moved out of this list.
Click on the checkbox item Get Data from PDF Files. You can also click on the hyperlink to read the documentation about this preview feature. Now click the OK button, and you would be presented with an informational dialog that says that you need to restart Power BI to enable this feature. Click the OK button to close this dialog, and then close Power BI and re-open it.
Once you re-open Power BI, repeat the same steps that we did earlier. Check the list of supported File Connectors from the Get Data menu, and this time you should be able to see the PDF File connector (which is in beta as of the publishing of this tip) in the list. This means that we can now use this connector to start reading PDF files and sourcing data from it.
Before we can start using this connector, we need to have a sample file with some text, some tables and may be some other content of interest. In this tip, our focus is to read a PDF file and import a table at a minimum so that we can use this data for analytics and reporting purposes. Once such sample file is available online from here. Let's look at this file. Open the URL in the browser, and you would find that it has 2 pages and has 3 tables in total as shown below.
Let's say that we are interested in the table on Page 2 of this file, which has a list of cars with related attributes. If we can import this table as-is in the Power BI Report, we can use it for different purposes like creating charts / graphs or aggregating / filtering data, etc. So, let's go ahead and start using our PDF connector to source data from this file.
In Power BI, click on Get Data Menu item, and select the PDF File Connector. You will be provided with an informational warning that it's a preview feature. For now, click Continue button.
Now it would prompt you to select the PDF file path. Provide this sample file URL mentioned above as shown below and click on the Open button.
Once this connector processes the file, you would find a navigator screen as shown below. On the left side you will find 5 options. The PDF connector detected all the 3 tables in our sample PDF file, so it provides an option to import specific tables. This is a great option for those who are interested in a specific piece of data from the PDF file. If you analyze carefully, it also tells you which table is on which page. The rest of the two options are at the bottom of the list, provides an option to read the entire page. There is 1 table (item) created per page. Click on Table002 as shown below and you will be able to see the same table that we saw earlier in the PDF file. Click on Load button to add this table to the report.
Once the model is loaded, you will be able to see the fields added to the report as shown below.
Click on all the fields and all these fields will get added to a table as shown below. Also, if you carefully analyze, the connector was able to detect the data types correctly, and it identified the Date field to be of date data type and Power BI automatically dissected various parts of the date in the table. Once the data and fields are in this format, it can be easily used for analytics and reporting.
Let's say we want to read the entire file contents of any given page, follow the steps mentioned above until you reach the navigator screen. Click on Page001 and you should be able to preview the page. All the contents of the page are divided into different fields based on how the connector was able to parse the data. This can be used in the file if the intention is to source the entire content of the page typically to present textual narratives in a report.
In this way, one can easily source data from PDF files without any coding. Another option to source data from PDF files is by using R Scripts, but that would require installation of some external packages on an R server as well as few lines of code. The method presented here is much easier and can be easily used by a non-technical user or someone who does not know R.
- Consider using the Power BI PDF File connector on different types of PDF files having different types of content and analyze the resulting interpretation of the file by the PDF file connector.
About the author
View all my tips