How to Automate dbt Exposures Creation via Crawlers and Github Actions

How to Automate dbt Exposures Creation via Crawlers and Github Actions

This article demonstrates dbt exposures and how you can build them automatically in your CI pipeline using Github Actions.

James Sopkin | Solutions Engineer

April 19, 2022

When you work with multiple dbt projects, it’s often hard to understand the dependencies between models across various projects. To solve this problem, dbt introduced a feature called dbt exposures. There is a PyPI project that automates the creation of dbt exposures. This article demonstrates what are dbt exposures and how you can build them automatically in your CI pipeline using Github Actions.
So what is a dbt exposure and why is it useful?
dbt is a tool typically used to transform data in ELT/ETL processes.
A dbt exposure is a type of dbt object that can be included in the documentation. It describes downstream use cases for dbt projects. It allows you to see how data is used in your dashboards and makes it significantly easier for analysts to determine the impact that changes to models will have on downstream reports.
The above image is a DAG (directed acyclic graph) or “lineage graph” that shows how data flows from original data sources to models to exposure. As you can see, on the right side of the image (highlighted in orange) is the dbt exposure. Everything to the left is an upstream dependency.
Why should you use the dbt-exposure-crawler?
The creation of dbt exposures may be tedious. This package automates this process. Here is what the dbt-exposure-crawler does:
  • looks at what projects exist in dbt,
  • queries workbooks from Tableau,
  • matches dbt projects to those workbooks,
  • it then creates dbt exposures.
In the following sections, we’ll look at how the dbt exposure works.
Before the crawler can determine which Tableau workbooks have upstream dbt projects, we need a manifest file. The manifest contains details about dbt resources such as nodes, metadata, and exposures. A manifest is generated after running dbt docs generate and its contents are what will be used to populate the dbt docs site including the lineage graph above.
Tableau Crawler
The Tableau crawler queries workbooks from the Tableau server through the metadata GraphQL API. The two queries below will retrieve both native and custom SQL workbooks with embedded data sources. Additionally, the native workbooks query has a filter for Snowflake tables only, but this can be changed to other data sources as well. The custom SQL workbooks will be filtered for Snowflake tables later on.
Query Parsing
Once the dbt models and the Tableau workbooks are all gathered nicely, it’s time to match them up. Tableau workbooks that have upstream dbt projects will be included in the list of exposures to be added to the manifest.
Github Action
Now that we’ve discussed different parts of the project, let’s discuss how this would all be integrated with a GitHub action.
  1. Generate a manifest file, typically with dbt docs generate
  2. Install dependencies
  3. Run the crawler and point it to that manifest file
The workflow would look as follows:
Finally, the dbt exposures will be generated using the manifest located at target/manifest.json. Those exposures will be added back to the manifest, and then that manifest will be used for the dbt documentation reflecting the newly added exposures.This right here is a twofold convenience. The dbt exposures crawler already generates the exposures and adds them back to the manifest, saving the burden of manually defining exposures in a YAML file. GitHub actions are a powerful CI/CD tool that automates the entire process. Whenever code is pushed to the specified branch, the dbt exposures crawler is triggered automatically. Additionally, a cron schedule may also be configured.
This article discussed how you may automate the generation of dbt exposures. Following the approach demonstrated in this post, data analysts can leverage the dbt exposures without anyone ever having to write out exposures manually. New exposures are found every time the workflow is executed, and the entire process is automated and seamless.
Now that's what you call working smarter, not harder!
Happy Engineering!