Setting up a complete dbt CI/CD pipeline with Azure Devops and AWS

Anton Karling
10 min readFeb 24, 2021

--

Have you been playing around with dbt (Data Build Tool) and are wondering how it could be made a part of your data platform and integrated in your development workflow? If so, this post is will hopefully be of help. Below I will explain how you can set up a CI/CD pipeline for dbt with Azure Devops and run scheduled and monitored jobs on AWS.

This post is not about how to use dbt (data build tool) and why it makes my life as Data Warehouse developer much more pleasurable. If you are interested in that I recommend going through the tutorial and find out for yourself. No, this post is about how you can set up CI/CD pipeline in Azure Devops to get the most of what dbt does so good - testing, releasing, documentation - things that most often are a pain with other data modeling tools. You will also learn how you could run monitored dbt jobs on a schedule.

dbt offers paid plans that takes care of everything this post is about to explain but the tool itself is open open source. You mostly interact with dbt using the command line. This allows for a lot of flexibility in setup and makes it easy to automate most things.

I will describe how we have set up a small scale, but fully functional, complete workflow for dbt using Azure Devops and AWS. We run Redshift as target, but most things applies to whichever data warehouse platform you are running.

Below is a high level picture over the CI/CD-pipeline which this post will try to explain.

CI/CD pipeline for dbt

The monitoring and scheduling setup, that is explained further down in the post, looks like this:

Schedule and monitoring setup for dbt

Requirements

Before we set up our pipeline we gathered some requirements

  • It should be easy for developers to work in their own “local” environment
  • It should be possible to run and test dbt on pull requests
  • Master branch merges should only be done from reviewed and tested builds
  • Documentation should always be up to date with the master branch
  • Jobs should run automatically on schedule
  • In case a job fails, we should be notified

What is special about the master branch then? Well, we have number of database schemas in our Redshift cluster that consumers such as analysts and BI tools use. These consumers do not want to query objects that suddenly change and break because developers are building new things. Models in the most recently released master branch is what the consumers should use - why I in this post call them master schemas.

Development

I do not think it is possible to spin up a local Redshift cluster, and even if it was it likely would be a cumbersome process that required a lot of time, compute and storage. Three things developers rarely claim to have enough of. So local in this context is being able to build, run and test models without interfering with other developers or consumers.

What we would like to achieve is that when a developer runs dbt locally, the objects being created should be named in a way that they do not impact any objects outside of the developers local project. This is easy to solve by editing ~/.dbt/profiles.yml. In our case each developer has a profile that looks like this:

# profiles.ymlredshift_development:
developer:
[pasword etc.]
host: sandbox.redshift.com
dbname: dev
schema: <developers name>
target: developer

And the project-configuration .yml looks like this:

# project.yml...models:
our_datamart:
staging:
materialized: view
schema: staging
target: developer

In my case I have set anton as schema. So when running the model models/staging/dim_dog.sql from my machine dev.anton_staging.dim_dog is created in Redshift.

Build

Every time someone wants to edit the master branch they need to do it through a pull request. This is a common practice that brings better quality through peer reviewing and being able to run tests on suggested changes.

It is impossible to verify that things work as intended by just looking at the code. So to validate that a pull request contains sane changes we want to run dbt run and dbt test on its content. And for the same reason of having separated development schemas in Redshift we want separated build schemas. If a build succeeds and other developers approve the changes we can be fairly confident no major bugs would be introduced in the master schemas.

The goal here is to name the objects in way that they do not interfere with any objects developers are working on or any master schema objects. In order to solve this, let’s look at how we handle credentials for our build and release pipelines. Azure Devops has a feature of securely storing files that can be used by these pipelines. We store a file there called redshift-profiles.yml that looks like this:

Profile stored as Secure file

What should happen when running a build is defined in a file called azure-pipelines.yml in the root of the repository. On build the pipeline agent executes the commands defined in each of the steps.

This is what our azure-pipelines.yml build definition looks like:

Build definition

The first two steps (Downloading Profile for Redshift and Installing Profile for Redshift) fetches redshift-profiles.yml from the secure file library and copies it into ~/.dbt/profiles.yml.

The third step (Setting build environment variables) picks up the pull request number and the branch name and sets them as environment variables. I will explain why further down.

The fourth step (Run and test dbt) will execute scripts/dbt-build.sh that looks like this:

Builds based on the master branch, such as merge builds, will not invoke dbt but the build will succeed. If a build is neither from the master branch or triggered from a pull request dbt is not run either, but the build result will be partially succeeded and have a warning flag.

However, in case the build is based on a pull request, it will triggerdbt run and dbt test with--target ci. As you see in the profiles.yml the ci-profile haveschema: dbt_build_{{ env_var('PR_NUMBER') }}. To give an example, when building on pull request#1066, the environment variable PR_NUMBER=1066 is available for dbt to pick up. And as that pull request contained the model models/staging/fact_dogs.sql, dbt creates the object dbt_build_1066_staging.fact_dogs in the development environment.

Master branch build, pull request build, and non-master/no pull request build

One thing to note here - with this setup no schemas are automatically dropped when pull requests are completed. So after completing one it is a good idea to drop the temporary build schemas manually. This is not really an issue though, it is easy to see what schemas are build-schemas and if accidentally dropping a schema for an active pull request it can be recreated by just triggering a new build.

The fourth step (Publish artifact - dbt) copies the project to the build artifact which then are published.

Scheduling artifact

There is no magic going on to build and publish the Docker image used for scheduling. We simply run docker build and then publish it the same way the dbt project is published.

Release

After a pull request have been approved, completed and merged we have an artifact that contains the project directory and a docker image. Objects on this artifact is then used when deploying to our different environments. The central idea when it comes to releasing is that it should require minimal manual effort and also ensure no bugs are introduced, why we run tests as part of the release.

In our case we have three Redshift environments, sandbox, staging, and prod. And for each of these stages we have a variable STAGE that is either sandbox, staging or prod depending on the stage. On release we simply trigger dbt run and dbt test with --target $(STAGE). And as we are using the same secure file as during the build phase, dbt creates the targets in the respective environments without any prefixes added to the schema names.

The whole thing is not much more complicated than this:

if [[ "$BUILD_SOURCEBRANCHNAME" != "master" ]]; then
echo "Releases are only allowed using master branch artifacts"
exit 1
fi
dbt run --target $(STAGE) --full-refresh
dbt test --target $(STAGE

As I mentioned earlier we do not want any non-master branch objects to be released to the master schemas. The if statement simply checks if the artifact comes from the master branch. If not it will fail and the release would be cancelled.

We also do a full-refresh on releasing, the project isn’t big enough (yet) to not complete in an acceptable time. One thing we have considered but not yet have implemented is to trigger dbt run and dbt test in temporary integration test schemas on release. Though, this could easily be done by just editing the secure file.

Another step of the the release is to push the Docker image to AWS ECR. This happens after the tests have passed and in case they fail the release is cancelled and the previous, and hopefully functioning, image will be used on the next scheduled run. If the objects needs to be fully refreshed to revert any bad changes the previously released master branch simply can be released again.

Documentation

dbt can generate documentation that can be hosted on a website-configured s3-bucket. So in the release we simply step into the project-directory of the artifact and run dbt docs generate --target $(STAGE). The STAGE-variable is needed so dbt can pick up metadata-statistics, as well as creating correct source-documentation. This step looks like this:

cd our-datamart
dbt docs generate --target $(STAGE)
aws s3 cp ./target s3://$STAGE-website-bucket.com

All release tasks are bundled into a task group that is reused in each environment stage. And when it is time to release we just click deploy on the stage and the rest happens automatically.

Ready for staging? Just click the staging button

Scheduling

We have chosen to run dbt on a schedule using Docker and ECS. This is the image that is created during the build phase which is pushed to ECR on release:

The two files being copied on docker build are daily-run.sh which looks like this:

#!/bin/sh
dbt --log-format json run --target scheduler --profiles-dir .
dbt --log-format json test --target scheduler --profiles-dir .

And profiles.yml that looks like this:

enlight_dwh_developer:
outputs:
scheduler:
type: redshift
type: redshift
threads: 1
host: "{{ env_var('HOST') }}"
port: "{{ env_var('PORT') | int }}"
user: "{{ env_var('USERNAME') }}"
pass: "{{ env_var('PASSWORD') }}"
dbname: "{{ env_var('DATABASE') }}"
schema: dbt
target: scheduler

We run the container using the serverless fargate launch type which requires minimal configuration. In the task definition you just need to configure memory, cpu, log destination and a few more network related parameters.

In the the task definition we have also defined that Redshift credentials should be read from AWS Secrets Manager and injected as environment variables on launch. So when the task is triggered the environment variables are populated then daily-run.sh executed. The --profiles-dir flag will make dbt look for the profile in the current work-directory and the profile will use the available environment variables to connect to Redshift.

To run the job on a schedule we simply have configured a cron-based rule that triggers the task on using AWS Eventbridge.

Monitoring

We are using Datadog, which is not an AWS product, for monitoring and Microsoft Teams integration. But you could just as well use some kind of Cloudwatch and SNS setup. Instead of going through how to work with Datadog I will briefly explain how to know if dbt failed without manually having to go through the logs of every run.

A dbt log row contains a few different fields, of which one is LevelName .

{
[…]
"levelname": "INFO",
[…]
}

LevelName could be either INFO, WARNING, or ERROR. In Datadog we have a Log Query Monitor that looks something like COUNT(LevelName:ERROR). If count > 0 Datadog will post a message to our Teams monitoring channel. We are also notified in case there are no logs at all, as this indicates the job have not even started.

Future work

What I described above is a basic but fully functioning setup. When scaling up there are a few things that could be worth considering.

First of all, the more models and data the slower builds would get. It is a good idea to only run and test added and updated models. dbt has an experimental feature that would do this for you. But as it currently is experimental we instead use some basic git commands to compare branches and build the model selection statement out of that.

Also, this way of scheduling is tightly bound to a single project. The Docker container is big and can only run one project. If having multiple projects a more sane setup would be to push the projects to S3 and have ECS read the files from there on each run.

Lastly, if wanting more flexibility and control over what models are run it would be a good idea to have a more sophisticated tool for orchestrating the dbt calls. We have peeked a bit at Airflow and AWS Step Functions, but the added complexity would not bring any value to us right now. If we go that way one day I might write a post on that.

Questions or input?

If you want help with things like this or you think this was stupid and I need help with things like this - drop me a message on Linkedin.

--

--