ETL stands for Extract, Rework, and Load. An ETL pipeline is basically only a knowledge transformation course of — extracting knowledge from one place, doing one thing with it, after which loading it again to the identical or a special place.
In case you are working with pure language processing by way of APIs, which I’m guessing most will begin doing, you’ll be able to simply hit the timeout threshold of AWS Lambda when processing your knowledge, particularly if no less than one operate exceeds quarter-hour. So, whereas Lambda is nice as a result of it’s fast and actually low cost, the timeout is usually a hassle.
The selection right here is to deploy your code as a container that has the choice of working so long as it must and run it on a schedule. So, as an alternative of spinning up a operate as you do with Lambda, we are able to spin up a container to run in an ECS cluster utilizing Fargate.
For clarification, Lambda, ECS and EventBridge are all AWS Companies.
Simply as with Lambda, the associated fee of working a container for an hour or two is minimal. Nevertheless, it’s a bit extra difficult than working a serverless operate. However if you happen to’re studying this, then you definitely’ve most likely run into the identical points and are questioning what the simplest strategy to transition is.
I’ve created a quite simple ETL template that makes use of Google BigQuery to extract and cargo knowledge. This template will get you up and working inside a couple of minutes if you happen to observe alongside.
Utilizing BigQuery is completely non-compulsory however I often retailer my long run knowledge there.
As an alternative of constructing one thing advanced right here, I’ll present you the way to construct one thing minimal and hold it actually lean.
When you don’t must course of knowledge in parallel, you shouldn’t want to incorporate one thing like Airflow. I’ve seen just a few articles on the market that unnecessarily arrange advanced workflows, which aren’t strictly needed for easy knowledge transformation.
Apart from, if you happen to really feel such as you need to add on to this later, that possibility is yours.
Workflow
We’ll construct our script in Python as we’re doing knowledge transformation, then bundle it up with Docker and push it to an ECR repository.
From right here, we are able to create a process definition utilizing AWS Fargate and run it on a schedule in an ECS cluster.
Don’t fear if this feels overseas; you’ll perceive all these providers and what they do as we go alongside.
Know-how
In case you are new to working with containers, then consider ECS (Elastic Container Service) as one thing that helps us arrange an atmosphere the place we are able to run a number of containers concurrently.
Fargate, alternatively, helps us simplify the administration and setup of the containers themselves utilizing Docker photographs — that are known as duties in AWS.
There may be the choice of utilizing EC2 to arrange your containers, however you would need to do much more handbook work. Fargate manages the underlying situations for us, whereas with EC2, you might be required to handle and deploy your individual compute situations. Therefore, Fargate is sometimes called the ‘serverless’ possibility.
I discovered a thread on Reddit discussing this, if you happen to’re eager to learn a bit about how customers discover utilizing EC2 versus Fargate. It may give you an thought of how folks examine EC2 and Fargate.
Not that I’m saying Reddit is the supply of fact, however it’s helpful for getting a way of consumer views.
Prices
The first concern I often have is to maintain the code working effectively whereas additionally managing the entire price.
As we’re solely working the container when we have to, we solely pay for the quantity of sources we use. The worth we pay is set by a number of components, such because the variety of duties working, the execution length of every process, the variety of digital CPUs (vCPUs) used for the duty, and reminiscence utilization.
However to provide you a tough thought, on a excessive stage, the entire price for working one process is round $0.01384 per hour for the EU area, relying on the sources you’ve provisioned.
If we have been to check this value with AWS Glue we are able to get a little bit of perspective whether it is good or not.
If an ETL job requires 4 DPUs (the default quantity for an AWS Glue job) and runs for an hour, it might price 4 DPUs * $0.44 = $1.76. This price is for just one hour and is considerably increased than working a easy container.
That is, in fact, a simplified calculation, and the precise variety of DPUs can range relying on the job. You possibly can try AWS Glue pricing in additional element on their pricing web page.
To run long-running scripts, organising your individual container and deploying it on ECS with Fargate is sensible, each by way of effectivity and price.
To observe this text, I’ve created a easy ETL template that can assist you stand up and working shortly.
This template makes use of BigQuery to extract and cargo knowledge. It’s going to extract just a few rows, do one thing easy after which load it again to BigQuery.
After I run my pipelines I’ve different issues that rework knowledge — I take advantage of APIs for pure language processing that runs for just a few hours within the morning — however that’s as much as you so as to add on later. That is simply to provide you a template that will probably be straightforward to work with.
To observe alongside this tutorial, the primary steps will probably be as follows:
- Organising your native code.
- Organising an IAM consumer & the AWS CLI.
- Construct & push Docker picture to AWS.
- Create an ECS process definition.
- Create an ECS cluster.
- Schedule to your duties.
In whole it shouldn’t take you longer than 20 minutes to get by this, utilizing the code I’ll offer you. This assumes you’ve got an AWS account prepared, and if not, add on 5 to 10 minutes.
The Code
First create a brand new folder regionally and find into it.
mkdir etl-pipelines
cd etl-pipelines
Be sure to have python put in.
python --version
If not, set up it regionally.
When you’re prepared, you’ll be able to go forward and clone the template I’ve already arrange.
git clone https://github.com/ilsilfverskiold/etl-pipeline-fargate.git
When it has completed fetching the code, open it up in your code editor.
First examine the major.py file to look how I’ve structured the code to know what it does.
Primarily, it’s going to fetch all names with “Doe” in it from a desk in BigQuery that you specify, rework these names after which insert them again into the identical knowledge desk as new rows.
You possibly can go into every helper operate to see how we arrange the SQL Question job, rework the information after which insert it again to the BigQuery desk.
The thought is in fact that you just arrange one thing extra advanced however it is a easy take a look at run to make it straightforward to tweak the code.
Setting Up BigQuery
If you wish to proceed with the code I’ve ready you will want to arrange just a few issues in BigQuery. In any other case you’ll be able to skip this half.
Listed here are the issues you will want:
- A BigQuery desk with a area of ‘identify’ as a string.
- A few rows within the knowledge desk with the identify “Doe” in it.
- A service account that may have entry to this dataset.
To get a service account you will want to navigate to IAM within the Google Cloud Console after which to Service Accounts.
As soon as there, create a brand new service account.
As soon as it has been created, you will want to provide your service account BigQuery Person entry globally by way of IAM.
Additionally, you will have to provide this service account entry to the dataset itself which you do in BigQuery instantly by way of the dataset’s Share button after which by urgent Add Principal.
After you’ve given the consumer the suitable permissions, be sure to return to the Service Accounts after which obtain a key. This provides you with a json file that you want to put in your root folder.
Now, an important half is ensuring the code has entry to the google credentials and is utilizing the proper knowledge desk.
You’ll need the json file you’ve downloaded with the Google credentials in your root folder as google_credentials.json and then you definitely need to specify the proper desk ID.
Now you would possibly argue that you do not need to retailer your credentials regionally which is just proper.
You possibly can add within the possibility of storing your json file in AWS Secrets and techniques Supervisor later. Nevertheless, to begin, this will probably be simpler.
Run ETL Pipeline Regionally
We’ll run this code regionally first, simply so we are able to see that it really works.
So, arrange a Python digital atmosphere and activate it.
python -m venv etl-env
supply etl-env/bin/activate # On Home windows use `venvScriptsactivate`
Then set up dependencies. We solely have google-cloud-bigquery in there however ideally you’ll have extra dependencies.
pip set up -r necessities.txt
Run the primary script.
python major.py
This could log ‘New rows have been added’ in your terminal. This can then affirm that the code works as we’ve supposed.
The Docker Picture
Now to push this code to ECS we should bundle it up right into a Docker picture which suggests that you’ll want Docker put in regionally.
When you don’t have Docker put in, you’ll be able to obtain it right here.
Docker helps us bundle an software and its dependencies into a picture, which will be simply acknowledged and run on any system. Utilizing ECS, it’s required of us to bundle our code into Docker photographs, that are then referenced by a process definition to run as containers.
I’ve already arrange a Dockerfile in your folder. It’s best to be capable to look into it there.
FROM --platform=linux/amd64 python:3.11-slimWORKDIR /app
COPY . /app
RUN pip set up --no-cache-dir -r necessities.txt
CMD ["python", "main.py"]
As you see, I’ve stored this actually lean as we’re not connecting internet site visitors to any ports right here.
We’re specifying AMD64 which you will not want if you’re not on a Mac with an M1 chip however it shouldn’t harm. This can specify to AWS the structure of the docker picture so we don’t run into compatibility points.
Create an IAM Person
When working with AWS, entry will have to be specified. Many of the points you’ll run into are permission points. We’ll be working with the CLI regionally, and for this to work we’ll should create an IAM consumer that may want fairly broad permissions.
Go to the AWS console after which navigate to IAM. Create a brand new consumer, add permissions after which create a brand new coverage to connect to it.
I’ve specified the permissions wanted in your code within the aws_iam_user.json file. You’ll see a brief snippet under of what this json file seems like.
{
"Model": "2012-10-17",
"Assertion": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"iam:CreateRole",
"iam:AttachRolePolicy",
"iam:PutRolePolicy",
"ecs:DescribeTaskDefinition",
...more
],
"Useful resource": "*"
}
]
}
You’ll want to enter this file to get all the permissions you will want to set, that is only a quick snippet. I’ve set it to fairly just a few, which you will need to tweak to your individual preferences later.
When you’ve created the IAM consumer and also you’ve added the proper permissions to it, you will want to generate an entry key. Select ‘Command Line Interface (CLI)’ when requested about your use case.
Obtain the credentials. We’ll use these to authenticate in a bit.
Arrange the AWS CLI
Subsequent, we’ll join our terminal to our AWS account.
When you don’t have the CLI arrange but you’ll be able to observe the directions right here. It’s very easy to set this up.
When you’ve put in the AWS CLI you’ll must authenticate with the IAM consumer we simply created.
aws configure
Use the credentials we downloaded from the IAM consumer within the earlier step.
Create an ECR Repository
Now, we are able to get began with the DevOps of all of it.
We’ll first must create a repository in Elastic Container Registry. ECR is the place we are able to retailer and handle our docker photographs. We’ll be capable to reference these photographs from ECR once we arrange our process definitions.
To create a brand new ECR repository run this command in your terminal. This can create a repository known as bigquery-etl-pipeline.
aws ecr create-repository — repository-name bigquery-etl-pipeline
Word the repository URI you get again.
From right here we’ve got to construct the docker picture after which push this picture to this repository.
To do that you’ll be able to technically go into the AWS console and discover the ECR repository we simply created. Right here AWS will allow us to see the whole push instructions we have to run to authenticate, construct and push our docker picture to this ECR repository.
Nevertheless, if you’re on a Mac I might recommendation you to specify the structure when constructing the docker picture or you might run into points.
In case you are following together with me, then begin with authenticating your docker shopper like so.
aws ecr get-login-password --region YOUR_REGION | docker login --username AWS --password-stdin YOUR_ACCOUNT_ID.dkr.ecr.YOUR_REGION.amazonaws.com
Remember to change the values, area and account ID the place relevant.
Construct the docker picture.
docker buildx construct --platform=linux/amd64 -t bigquery-etl-pipeline .
That is the place I’ve tweaked the command to specify the linux/amd64 structure.
Tag the docker picture.
docker tag bigquery-etl-pipeline:newest YOUR_ACCOUNT_ID.dkr.ecr.YOUR_REGION.amazonaws.com/bigquery-etl-pipeline:newest
Push the docker picture.
docker push YOUR_ACCOUNT_ID.dkr.ecr.YOUR_REGION.amazonaws.com/bigquery-etl-pipeline:newest
If the whole lot labored as deliberate you’ll see one thing like this in your terminal.
9f691c4f0216: Pushed
ca0189907a60: Pushed
687f796c98d5: Pushed
6beef49679a3: Pushed
b0dce122021b: Pushed
4de04bd13c4a: Pushed
cf9b23ff5651: Pushed
644fed2a3898: Pushed
Now that we’ve got pushed the docker picture to an ECR repository, we are able to use it to arrange our process definition utilizing Fargate.
When you run into EOF points right here it’s most definitely associated to IAM permissions. Remember to give it the whole lot it wants, on this case full entry to ECR to tag and push the picture.
Roles & Log Teams
Keep in mind what I advised you earlier than, the most important points you’ll run into in AWS pertains to roles between totally different providers.
For this to circulate neatly we’ll have to ensure we arrange just a few issues earlier than we begin organising a process definition and an ECS cluster.
To do that, we first should create a process position — this position is the position that may want entry to providers within the AWS ecosystem from our container — after which the execution position — so the container will be capable to pull the docker picture from ECR.
aws iam create-role --role-name etl-pipeline-task-role --assume-role-policy-document file://ecs-tasks-trust-policy.json
aws iam create-role - role-name etl-pipeline-execution-role - assume-role-policy-document file://ecs-tasks-trust-policy.json
I’ve specified a json file known as ecs-tasks-trust-policy.json in your folder regionally that it’ll use to create these roles.
For the script that we’re pushing, it received’t must have permission to entry different AWS providers so for now there isn’t a want to connect insurance policies to the duty position. Nonetheless, you might need to do that later.
Nevertheless, for the execution position although we might want to give it ECR entry to drag the docker picture.
To connect the coverage AmazonECSTaskExecutionRolePolicy to the execution position run this command.
aws iam attach-role-policy --role-name etl-pipeline-execution-role --policy-arn arn:aws:iam::aws:coverage/service-role/AmazonECSTaskExecutionRolePolicy
We additionally create one final position whereas we’re at it, a service position.
aws iam create-service-linked-role - aws-service-name ecs.amazonaws.com
When you don’t create the service position you might find yourself with an errors similar to ‘Unable to imagine the service linked position. Please confirm that the ECS service linked position exists’ once you attempt to run a process.
The very last thing we create a log group. Making a log group is important for capturing and accessing logs generated by your container.
To create a log group you’ll be able to run this command.
aws logs create-log-group - log-group-name /ecs/etl-pipeline-logs
When you’ve created the execution position, the duty position, the service position after which the log group we are able to proceed to arrange the ECS process definition.
Create an ECS Process Definition
A process definition is a blueprint in your duties, specifying what container picture to make use of, how a lot CPU and reminiscence is required, and different configurations. We use this blueprint to run duties in our ECS cluster.
I’ve already arrange the duty definition in your code at task-definition.json. Nevertheless, you want to set your account id in addition to area in there to ensure it runs because it ought to.
{
"household": "my-etl-task",
"taskRoleArn": "arn:aws:iam::ACCOUNT_ID:position/etl-pipeline-task-role",
"executionRoleArn": "arn:aws:iam::ACCOUNT_ID:position/etl-pipeline-execution-role",
"networkMode": "awsvpc",
"containerDefinitions": [
{
"name": "my-etl-container",
"image": "ACCOUNT_ID.dkr.ecr.REGION.amazonaws.com/bigquery-etl-pipeline:latest",
"cpu": 256,
"memory": 512,
"essential": true,
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/etl-pipeline-logs",
"awslogs-region": "REGION",
"awslogs-stream-prefix": "ecs"
}
}
}
],
"requiresCompatibilities": ["FARGATE"],
"cpu": "256",
"reminiscence": "512"
}
Keep in mind the URI we acquired again once we created the ECR repository? That is the place we’ll use it. Keep in mind the execution position, the duty position and the log group? We’ll use it there as effectively.
When you’ve named the ECR repository together with the roles and log group precisely what I named mine then you’ll be able to merely change the account ID and Area on this json in any other case make sure that the URI is the proper one.
You may as well set CPU and reminiscence right here for what you’ll must run your process — i.e. your code. I’ve set .25 vCPU and 512 mb as reminiscence.
When you’re glad you’ll be able to register the duty definition in your terminal.
aws ecs register-task-definition --cli-input-json file://task-definition.json
Now you must be capable to go into Amazon Elastic Container Service after which discover the duty we’ve created underneath Process Definitions.
This process — i.e. blueprint — received’t run on it’s personal, we have to invoke it later.
Create an ECS Cluster
An ECS Cluster serves as a logical grouping of duties or providers. You specify this cluster when working duties or creating providers.
To create a cluster by way of the CLI run this command.
aws ecs create-cluster --cluster-name etl-pipeline-cluster
When you run this command, you’ll be capable to see this cluster in ECS in your AWS console if you happen to look there.
We’ll connect the Process Definition we simply created to this cluster once we run it for the following half.
Run Process
Earlier than we are able to run the duty we have to get ahold of the subnets which are out there to us together with a safety group id.
We will do that instantly within the terminal by way of the CLI.
Run this command within the terminal to get the out there subnets.
aws ec2 describe-subnets
You’ll get again an array of objects right here, and also you’re in search of the SubnetId for every object.
When you run into points right here, make sure that your IAM has the suitable permissions. See the aws_iam_user.json file in your root folder for the permissions the IAM consumer related to the CLI will want. I’ll stress this, as a result of it’s the primary points that I all the time run into.
To get the safety group ID you’ll be able to run this command.
aws ec2 describe-security-groups
You’re in search of GroupId right here within the terminal.
When you acquired no less than one SubnetId after which a GroupId for a safety group, we’re able to run the duty to check that the blueprint — i.e. process definition — works.
aws ecs run-task
--cluster etl-pipeline-cluster
--launch-type FARGATE
--task-definition my-etl-task
--count 1
--network-configuration "awsvpcConfiguration={subnets=[SUBNET_ID],securityGroups=[SECURITY_GROUP_ID],assignPublicIp=ENABLED}"
Do keep in mind to alter the names if you happen to’ve named your cluster and process definition otherwise. Keep in mind to additionally set your subnet ID and safety group ID.
Now you’ll be able to navigate to the AWS console to see the duty working.
In case you are having points you’ll be able to look into the logs.
If profitable, you must see just a few reworked rows added to BigQuery.
EventBridge Schedule
Now, we’ve managed to arrange the duty to run in an ECS cluster. However what we’re involved in is to make it run on a schedule. That is the place EventBridge is available in.
EventBridge will arrange our scheduled occasions, and we are able to set this up utilizing the CLI as effectively. Nevertheless, earlier than we arrange the schedule we first must create a brand new position.
That is life when working with AWS, the whole lot must have permission to work together with one another.
On this case, EventBridge will want permission to name the ECS cluster on our behalf.
Within the repository you’ve got a file known as trust-policy-for-eventbridge.json that I’ve already put there, we’ll use this file to create this EventBridge position.
Paste this into the terminal and run it.
aws iam create-role
--role-name ecsEventsRole
--assume-role-policy-document file://trust-policy-for-eventbridge.json
We then have to connect a coverage to this position.
aws iam attach-role-policy
--role-name ecsEventsRole
--policy-arn arn:aws:iam::aws:coverage/AmazonECS_FullAccess
We want it to no less than be capable to have ecs:RunTask however we’ve given it full entry. When you want to restrict the permissions, you’ll be able to create a customized coverage with simply the required permissions as an alternative.
Now let’s arrange the rule to schedule the duty to run with the duty definition day-after-day at 5 am UTC. That is often the time I’d like for it to course of knowledge for me so if it fails I can look into it after breakfast.
aws occasions put-rule
--name "ETLPipelineDailyRun"
--schedule-expression "cron(0 5 * * ? *)"
--state ENABLED
It’s best to obtain again an object with a area known as RuleArn right here. That is simply to substantiate that it labored.
Subsequent step is now to affiliate the rule with the ECS process definition.
aws occasions put-targets --rule "ETLPipelineDailyRun"
--targets "[{"Id":"1","Arn":"arn:aws:ecs:REGION:ACCOUNT_NUMBER:cluster/etl-pipeline-cluster","RoleArn":"arn:aws:iam::ACCOUNT_NUMBER:role/ecsEventsRole","EcsParameters":{"TaskDefinitionArn":"arn:aws:ecs:REGION:ACCOUNT_NUMBER:task-definition/my-etl-task","TaskCount":1,"LaunchType":"FARGATE","NetworkConfiguration":{"awsvpcConfiguration":{"Subnets":["SUBNET_ID"],"SecurityGroups":["SECURITY_GROUP_ID"],"AssignPublicIp":"ENABLED"}}}}]"
Keep in mind to set your individual values right here for area, account quantity, subnet and safety group.
Use the subnets and safety group that we acquired earlier. You possibly can set a number of subnets.
When you’ve run the command the duty is scheduled for five am day-after-day and also you’ll discover it underneath Scheduled Duties within the AWS Console.
AWS Secrets and techniques Supervisor (Optionally available)
So maintaining your Google credentials within the root folder isn’t perfect, even if you happen to’ve restricted entry to your datasets for the Google service account.
Right here we are able to add on the choice of shifting these credentials to a different AWS service after which accessing it from our container.
For this to work you’ll have to maneuver the credentials file to Secrets and techniques Supervisor, tweak the code so it may well fetch it to authenticate and be sure that the duty position has permissions to entry AWS Secrets and techniques Supervisor in your behalf.
Once you’re carried out you’ll be able to merely push the up to date docker picture to your ECR repo you arrange earlier than.
The Finish End result
Now you’ve acquired a quite simple ETL pipeline working in a container on AWS on a schedule. The thought is that you just add to it to do your individual knowledge transformations.
Hopefully this was a helpful piece for anybody that’s transitioning to organising their long-running knowledge transformation scripts on ECS in a easy, price efficient and easy method.
Let me know if you happen to run into any points in case there’s something I missed to incorporate.
❤