Today we will be extracting a financial table from a PDF and loading it into Snowflake using tabula-py for extracting tables, pandas to clean and shape our tables, and Prefect to run this flow for us.
While we all wish we could collect our data from a well-designed API, it isn’t always the case.
PDFs tables are one of those tricky data sources because they are not constructed as row-and-column data representations, so copying or interpreting them as such can be very frustrating. PDF tables can come in all types of shapes and sizes, with or without line separation, multi headers, and other quirks. While frustrating, you are bound to have to do it at some point; many industries like medical, insurance, and finance use them extensively.
Today we will be extracting a financial table and loading it into Snowflake using tabula-py for extracting tables, pandas to clean and shape our tables, and Prefect to run this flow for us.
When extracting data from PDF tables, they can be separated in two ways, by lines and by white spaces. Tabula-py as a result has three modes:
stream mode — made to read data separated by white spaces
lattice mode — for tables that use lines for separation
default mode — infers the mode of extraction
For most cases, the default guess option will be sufficient, but you can run into problems if you don’t specify the mode. Let’s take a look at the following example.
In this sample we have a table that is separated by white space so, as a result, we set the option of stream to true, we set the output format to “dataframe.” By selecting dataframe as the output, we can options from pandas’ read_csv function
. Here we are setting the first row as our header. The resulting dataframe looks like so:
Now let’s clean it and format it in a way we are able to store in Snowflake
As you can see, this data’s tags are unique, so it makes sense to use them as our column names in Snowflake. Let’s strip away all the non-alphanumeric characters, as well as stripping whitespace on either side. We uppercase the strings and replace the spaces between words with an underscore. We then transpose the table so that our formatted tags are now our headers. After that, we clean our new tags/index, and our resulting table looks like this:
Great, now we have clean and formatted tables! Now to our actual data.
Snowflake has requirements for the types of data
it can store. In our case we will be using only two types, VARCHAR and NUMBER.
Let's regex away any non-numeric types and then use pandas
to_numeric functions to get the typing to match our requirements. We also want to convert our formatted tables into CSV using pandas, making sure to specify that we don’t want an index, as Snowflake doesn’t support indices.
Now it’s time to load our table into Snowflake by using its Python connector. Be sure to specify the pandas option when installing the connector.
First, let’s create the engine for our connector. Prefect provides an easy way to manage secrets in the cloud. We store the variable using the cloud UI and access it in our code via the client.
Note that the example uses Python string interpolation for simplicity, but you should use formal SQL templating for production use cases.
Now let’s create our tables in Snowflake, if they aren’t already created. Note that the column names in our data frames and the column name on our Snowflake tables must be the same.
The flow for storing our data does the following:
And there you have it. Now you should have a basic understanding of how to not only extract a table from a PDF using tabula-py, but also how to repeatably do this task while transforming the table to conform to Snowflake’s schema, and loading the table using Snowflake’s Python connector.