MLOps: Information Pipeline Orchestration
Half 1 of Dataform 101: Fundamentals of a single repo, multi-environment Dataform with least-privilege entry management and infrastructure as code setup
Dataform is a brand new service built-in into the GCP suite of companies which permits groups to develop and operationalise advanced, SQL-based information pipelines. Dataform permits the appliance of software program engineering finest practices similar to testing, environments, model management, dependencies administration, orchestration and automatic documentation to information pipelines. It’s a serverless, SQL workflow orchestration workhorse inside GCP. Usually, as proven within the picture above, Dataform takes uncooked information, remodel it with all of the engineering finest practices and output a correctly structured information prepared for consumption.
The inspiration for this put up got here whereas I used to be migrating one in all our tasks’ legacy Dataform from the online UI to GCP BigQuery. In the course of the migration, I discovered phrases similar to launch configuration, workflow configuration, and growth workspace actually complicated and arduous to wrap my head round. That serves because the motivation to put in writing a put up explaining a number of the new terminologies used within the GCP Dataform. As well as, I might contact upon some fundamental circulate underlining single repo multi-environment Dataform operations in GCP. There are a number of methods to arrange Dataform so make sure to try finest practices from Google.
That is Half 1 of a two-part sequence coping with Dataform fundamentals and setup. In Half 2, I would offer a walkthrough of the Terraform setup exhibiting easy methods to implement the least entry management when provisioning Dataform. If you wish to have a sneak peek into that, make sure to try the repo.
Implementation in Dataform is akin to GitHub workflow. I’ll distinction similarity between the 2 and create analogies to make it comprehensible. It’s simple to think about Dataform as a neighborhood GitHub repository. When Dataform is being arrange, it can request {that a} distant repository is configured much like how native GitHub is paired with distant origin. With this state of affairs setup in thoughts, lets rapidly undergo some Dataform terminologies.
Growth Workspaces
That is analogous to native GitHub department. Much like how a department is created from GitHub principal, a brand new Dataform growth workspace would checkout an editable copy of principal Dataform repo code. Growth workspaces are unbiased of one another much like GitHub branches. Code growth and experimentations would happen inside the growth workspace and when the code are dedicated and pushed, a distant department with comparable identify to the event workspace is created. It’s price mentioning that the GitHub repo from which an editable code is checked right into a growth workspace is configurable. It could be from the primary department or some other branches within the distant repo.
Launch Configuration
Dataform makes use of a mixture of .sqlx
scripts with Javascript .js
for information transformations and logic. Because of this, it first generates a compilation of the codebase to get a normal and reproducible pipeline illustration of the codebase and make sure that the scripts will be materialised into information. Launch configuration is the automated course of by which this compilation takes place. On the configured time, Dataform would try the code in a distant principal repo (that is configurable and will be modified to focus on any of the distant branches) and compile it right into a JSON config file. The method of trying out the code and producing the compilation is what the discharge configuration covers.
Workflow Configuration
So the output of the discharge configuration is a .json
config file. Workflow configuration determines when to run the config file, what identification ought to run it and which setting would the config file output be manifested or written to.
Since workflow configuration would wish the output of launch configuration, it’s affordable to make sure that it runs latter than the discharge configuration. The reason is that launch configuration might want to first authenticate to the distant repo (which generally fail), checkout the code and compile it. These steps occur in seconds however might take extra time in case of community connection failure. For the reason that workflow configuration wants the .json
compiled file generated by launch configuration, it is sensible to schedule it later than the discharge configuration. If scheduled on the similar time, the workflow configuration would possibly use the earlier compilation, which means that the most recent adjustments should not instantly mirrored within the BQ tables till the subsequent workflow configuration runs.
Environments
One of many options of Dataform is the performance that allows manifesting code into totally different environments similar to growth, staging and manufacturing. Working with a number of environments brings the problem of how Dataform needs to be arrange. Ought to repositories be created in a number of environments or in only one setting? Google mentioned a few of these tradeoffs within the Dataform finest practices part. This put up demonstrates organising Dataform for staging and manufacturing environments with each information materialised into each environments from a single repo.
The environments are arrange as GCP tasks with a customized service account for every. Dataform is simply created within the staging setting/challenge as a result of we shall be making numerous adjustments and it’s higher to experiment inside the staging (or non manufacturing) setting. Additionally, staging setting is chosen because the setting through which the event code is manifested. This implies dataset and tables generated from growth workspace are manifested inside the staging setting.
As soon as the event is completed, the code is dedicated and pushed to the distant repository. From there, a PR will be raised and merged to the primary repo after evaluate. Throughout scheduled workflow, each launch and workflow configurations are executed. Dataform is configured to compile code from the primary department and execute it inside manufacturing setting. As such, solely reviewed code goes to manufacturing and any growth code stays within the staging setting.
In abstract, from the Dataform structure circulate above, code developed within the growth workspaces are manifested within the staging setting or pushed to distant GitHub the place it’s peer reviewed and merged to the primary department. Launch configuration compiles code from the primary department whereas workflow configuration takes the compiled code and manifest its information within the manufacturing setting. As such, solely reviewed code within the GitHub principal department are manifested within the manufacturing setting.
Authentication for Dataform may very well be advanced and difficult particularly when organising for a number of environments. I shall be utilizing instance of staging and manufacturing environments to elucidate how that is accomplished. Let’s break down the place authentication is required and the way that’s accomplished.
The determine above reveals a easy Dataform workflow that we will use to trace the place authentication is required and for what sources. The circulate chronicles what occurs when Dataform runs within the growth workspace and on schedule (launch and workflow configurations).
Machine Person
Lets discuss machine customers. Dataform requires credentials to entry GitHub when trying out the code saved on a distant repository. It’s doable to make use of particular person credentials however the perfect apply is to make use of a machine person in an organisation. This apply ensures that the Dataform pipeline orchestration is unbiased of particular person identities and won’t be impacted by their departure. Establishing machine person means utilizing an identification not tied to a person to arrange GitHub account as detailed right here. Within the case of Dataform, a private entry token (PAT) is generated for the machine person account and retailer as secret on GCP secret supervisor. The machine person also needs to be added as outdoors collaborator to the Dataform distant repository with a learn and write entry. We’ll see how Dataform is configured to entry the key later within the Terraform code. If the person decides to make use of their identification as an alternative of a machine person, a token needs to be generated as detailed right here.
GitHub Authentication Stream
Dataform makes use of its default service account for implementation so when a Dataform motion is to be carried out, it begins with the default service account. I assume you’ve got arrange a machine person, add the person as a collaborator to the distant repository, and add the person PAT as a secret to GCP secret supervisor. To authenticate to GitHub, default service account must extract secret from the key supervisor. Default service account requires secretAccessor function to entry the key. As soon as the key is accessed, default service account can now impersonate the machine person and because the machine person is added as a collaborator on the distant Git repo, default service account now has entry to the distant GitHub repository as a collaborator. The circulate is proven within the GitHub authentication workflow determine.
Growth Workspace Authentication
When execution is triggered from the event workspace, the default service account assumes the staging setting customized service account to manifest the output inside the staging setting. To have the ability to impersonate the staging setting customized service account, the default service account requires the iam.serviceAccountTokenCreator function on the staging service account. This permits the default service account to create a brief lived token, much like the PAT used to impersonate the machine person, for the staging customized service account and and as such impersonate it. Therefore, the staging customized service account is granted all of the required permissions to put in writing to BQ tables and the default service account will inherit these permissions when impersonating it.
Workflow Configuration Authentication
After trying out the repo, launch configuration will generate a compiled config .json
file from which workflow configurations will generate information. With a view to write the information to manufacturing BQ tables, the default service account requires the iam.serviceAccountTokenCreator function on the manufacturing customized service account. Comparable to what’s accomplished for the staging customized service account, the manufacturing service account is granted all required permissions to put in writing to manufacturing setting BQ tables and the default service account will inherit all of the permissions when it impersonate it.
Abstract
In abstract, the default service account is the primary protagonist. It impersonates machine person to authenticate to GitHub as a collaborator utilizing the machine person PAT. It additionally authenticate to the staging and manufacturing environments by impersonating their respective customized service accounts utilizing a brief lived token generated with the function serviceAccountTokenCreator. With this understanding, it’s time to provision Dataform inside GCP utilizing Terraform. Look out for Half 2 of this put up for that and or try the repo for the code.
Picture credit score: All photographs on this put up have been created by the Creator