Skip to main content

Data Engineering and Tools

What is Data?

Before jumping to Data Engineering, let's start from what actually is Data. Well in today's world data is everything and everything is data. While scrolling social media, playing games online, surfing internet etc., terabytes of data are being processed and stored. Data nowadays are asset to people, So these assets need to be well managed for taking more benefits as we can.

Data Lifecycle

If we take a look at impact of data in every gadget and tech we use, we never actually think how the data is reaching us in such form and from where? Data goes through many processes for it to be used in real world some data are used as it is created like in real time system whereas some are used better with certain time intervals.
First data is collected from various sources then it is ingested to the pipeline and then it is stored . The data is the processed i.e. converted to the form as desired by the users and used for analysis. The processing storing and analyzing steps are iterative depending upon the needs of the company and business requirements. 



                                                        fig. Diagram to visualize the cycle.

Data Engineering

Well, if we describe data engineering in simple terms, it is a discipline within data management that focuses specially in designing, constructing, and maintaining the systems and infrastructure necessary for the collection, storage, processing, and analysis of large volumes of data. Well as i already mentioned this is period of data or should i say big data, data engineers should be able to handle not less terabytes of data, and as per requirements we have different tools to complete the task.
some typical task involved in data engineering are,

  • Data Ingestion
  • Data Transformation
  • Data Storage
  • Data Processing
  • Data Quality
  • Performance optimization
  • Data Security

Its quiet amazing to see how data engineers can play vital role in deriving insights form data. Data Engineers use various tools and technique to develop simple to complex pipelines of data.

Tools

  • AWS Glue
  • AWS Kinesis
  • Apache Hadoop
  • Apache Airflow
  • Apache Kafka 
  • Apache Spark
  • Google Cloud
  • Snowflake
  • Microsoft Azure Data Factory
  • MySQL
Well, most of the tool i mentioned are free and open source how ever tools like Glue, Kinesis  etc are paid and is based on pay as you use policy. Each tools have similar as well as different purpose. Like Airflow is used for scheduling and monitoring the workflow or task. Kafka on other hand is use to handle Realtime streaming data. Some tools like AWS Glue, AWS Kinesis are fully managed serverless tools provided by amazon, where we can make data pipeline easily by using AWS services. Each tool  can be separate topic of discussion as they have variety of use case and working mechanism.


Conclusion

Overall, we can say how data can influence our day-to-day life and how we depend on them. Imagine us getting weather data of each and every country on your mobile screen and only getting filtered weather data of your particular area, which one is more convenient, well this is small example of how handling data in correct way is so important. Well all of these generally come under Data Engineering.


Reference

https://chat.openai.com/








Comments

Popular posts from this blog

Docker With PostgreSQL

Brief Introduction to Docker img src Docker is a platform as a service (PaaS) for developing, shipping and running applications in containers. By containers, i mean, a lightweight, portable and self-sufficient unit capable of running applications and their dependencies. It is a runnable instance of a Docker image. Docker container runs inside a host machine but completed isolated from host machine. Importance of Docker img src Some of the importance of docker are highlighted as: Dependency Management As data engineers we work on various pipelines and applications. So, to simplify the dependency and make sure we don't end up with conflict across different environment, it is necessary to encapsulate the dependencies of applications in a container and generate an image for it. Portability Docker containers are portable and can be easily moved between different environments, which makes it simpler to deploy and share pipelines and applications across teams or organizations easily. Envi...

Make Automation

We can create a Makefile in Python which helps to automate tasks like running your Python scripts, testing, or building your project.  A Makefile is a file used by the make build automation tool to define a set of tasks that should be executed. Each task is written as a target, and it can depend on other tasks or files. In general, structure of makefile are: Target Dependencies Commands Ok, lets understand this with help of an example. Makefile (has no extensions) # Define variables for system Python and pip for virtual environment SYSTEM_PYTHON = python3 VENV_PYTHON = myvenv/bin/python VENV_PIP = myvenv/bin/pip # Target: venv # Creates a virtual environment in 'myvenv' directory using system Python. venv : $(SYSTEM_PYTHON) -m venv myvenv # Target: install # Installs dependencies if 'requirements.txt' exists. install : venv $(VENV_PIP) install -r requirements.txt; # Target: all # Runs venv, install, and run targets. all : venv install In example above...

Pyspark Transformation and Actions

 Pyspark Transformation and Actions RDD Transformation Examples Some of the examples of RDD transformations are: flatmap() map() reduceByKey() filter() sortByKey() RDD Actions Example some of the examples of RDD actions are: count() first() max() reduce() take() collect() lets us understand by code how RDD is fault tolerant, immutable, how lazy evaluation works and also understand its distributed nature. lets define a RDD , while converting a data to RDD, it is not directly loaded in memory, first of all  a lineage graph is made for the data, such that even if any node crashes in future, data can be recreated. This lineage graph helps in fault tolerance by allowing Spark to recreate lost data partitions in case of any reason causing node failure. RDDs are distributed across multiple nodes in a cluster, allowing parallel processing of data records. rdd = sc.parallelize([ 1 , 2 , 3 , 4 , 5 ]) the immutability nature of RDD can be understand such that, notice now we didnt changed...