Skip to main content

ETL using AWS Glue

Basic Idea

ETL is the core task of Data Engineers. Extracting Data from various sources whether it is streaming data or historical data, then transforming it into suitable form such that it can be used as per business requirements. Finally, data is loaded into suitable storage space. 
In this blog we will see the ETL pipeline using AWS Glue. 

AWS GLUE

Aws Glue is a fully managed serverless ETL pipeline services. Serverless means that developers or users can build ad run applications without having to manage servers. It is totally managed by AWS and follows pay as you use kind of facilities. AWS is use to prepare and transform data for analytics and other processing tasks. It simplifies the process of data cleaning, data transformation into desired format and so on. 
Let's understand the architecture of AWS Glue.

AWS GLUE ARCHITECTURE



Data Stores can be anything depending upon the use case i.e. S3, Redshift etc. We load our data in s3 for further processing. Crawler on other hand crawls through the folders of data stores and generates the schema and stores the metadata in Glue Data Catalog. Not only from data stores, but data catalog can also have meta data from various sources including streaming data source as well.

Data source uses data catalog for getting data obtained from data store. Also, for data transformation as well, the pipeline relies on the metadata from the glue data catalog. Well this clarifies the importance of maintaining meta data in data catalog. However, data target for storing the transformed data is not directly dependent on data catalog although glue data catalog can still provide some value when working with targets for storing transformed data.  

Some Common Terminologies in Glue

Data Store

Data stores is a storage location where the source data is stored in AWS Glue.

Data Source

Data source is a dataset or data itself that is within the data store.

Data Target

Data target is where the final transformed data is stored.

Crawler

Crawler does the same task as it sounds. It is use to crawl the dataset in data store to generate schemas of the data automatically and identify data format, relation between data etc and stores the metadata in the data catalog.

Simple ETL Pipelining Walkthrough in AWS Glue

Make a source and destination folders in s3 bucket.




Navigate to AWS Glue and Create Crawler where source data location should be the one, we created above. Also, database should be created while setting up the crawler, which is helpful for quering the dataset using SQL.










Next navigate to jobs under AWS Glue studio, for simplicity select "Visual with Source and Target" and click Create.





Finally, we will see following diagram


Congrats, the Simple ETL pipeline is made, however we have not specified the source and target. When clicked each block will have their respective properties as shown below,



Select the Data catalog table option and continue for next block for target where the location folder is already made initially.
The transformation block has default mapping only, we can adjust, add, delete and perform lots of tasks.

Finally click on Run


Now the transformed data can be seen in destination s3 folder. 
 

Conclusion

Hence, AWS Glue is a powerful tool for ETL pipelining. AWS Glue can be used with various other AWS services as well for making it even more powerful, like with Hudi, AWS Lambda etc.


References










Comments

Popular posts from this blog

Docker With PostgreSQL

Brief Introduction to Docker img src Docker is a platform as a service (PaaS) for developing, shipping and running applications in containers. By containers, i mean, a lightweight, portable and self-sufficient unit capable of running applications and their dependencies. It is a runnable instance of a Docker image. Docker container runs inside a host machine but completed isolated from host machine. Importance of Docker img src Some of the importance of docker are highlighted as: Dependency Management As data engineers we work on various pipelines and applications. So, to simplify the dependency and make sure we don't end up with conflict across different environment, it is necessary to encapsulate the dependencies of applications in a container and generate an image for it. Portability Docker containers are portable and can be easily moved between different environments, which makes it simpler to deploy and share pipelines and applications across teams or organizations easily. Envi...

Make Automation

We can create a Makefile in Python which helps to automate tasks like running your Python scripts, testing, or building your project.  A Makefile is a file used by the make build automation tool to define a set of tasks that should be executed. Each task is written as a target, and it can depend on other tasks or files. In general, structure of makefile are: Target Dependencies Commands Ok, lets understand this with help of an example. Makefile (has no extensions) # Define variables for system Python and pip for virtual environment SYSTEM_PYTHON = python3 VENV_PYTHON = myvenv/bin/python VENV_PIP = myvenv/bin/pip # Target: venv # Creates a virtual environment in 'myvenv' directory using system Python. venv : $(SYSTEM_PYTHON) -m venv myvenv # Target: install # Installs dependencies if 'requirements.txt' exists. install : venv $(VENV_PIP) install -r requirements.txt; # Target: all # Runs venv, install, and run targets. all : venv install In example above...

Pyspark Transformation and Actions

 Pyspark Transformation and Actions RDD Transformation Examples Some of the examples of RDD transformations are: flatmap() map() reduceByKey() filter() sortByKey() RDD Actions Example some of the examples of RDD actions are: count() first() max() reduce() take() collect() lets us understand by code how RDD is fault tolerant, immutable, how lazy evaluation works and also understand its distributed nature. lets define a RDD , while converting a data to RDD, it is not directly loaded in memory, first of all  a lineage graph is made for the data, such that even if any node crashes in future, data can be recreated. This lineage graph helps in fault tolerance by allowing Spark to recreate lost data partitions in case of any reason causing node failure. RDDs are distributed across multiple nodes in a cluster, allowing parallel processing of data records. rdd = sc.parallelize([ 1 , 2 , 3 , 4 , 5 ]) the immutability nature of RDD can be understand such that, notice now we didnt changed...